rungalileo/agent-leaderboard
Ranking LLMs on agentic tasks
Combines synthetic dataset generation with multi-turn conversation simulation across five enterprise domains (banking, healthcare, insurance, telecom, investment) to evaluate agents on task completion and tool-calling accuracy. Introduces Action Completion (AC) and Tool Selection Quality (TSQ) metrics that measure real-world effectiveness—whether agents accomplish all user goals with correct tool selection—rather than isolated capability scores. Hosts a live leaderboard on Hugging Face comparing 17+ models and provides open datasets for reproducible agent evaluation in production scenarios.
217 stars.
Stars
217
Forks
23
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Nov 18, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/rungalileo/agent-leaderboard"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
agentscope-ai/OpenJudge
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards