rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

/ 100

Emerging

Combines synthetic dataset generation with multi-turn conversation simulation across five enterprise domains (banking, healthcare, insurance, telecom, investment) to evaluate agents on task completion and tool-calling accuracy. Introduces Action Completion (AC) and Tool Selection Quality (TSQ) metrics that measure real-world effectiveness—whether agents accomplish all user goals with correct tool selection—rather than isolated capability scores. Hosts a live leaderboard on Hugging Face comparing 17+ models and provides open datasets for reproducible agent evaluation in production scenarios.

217 stars.

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

217

Forks

Language

Jupyter Notebook

License

MIT

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Explore AI Agents

All categories Trending AI Agent directory Insights