Agent Evaluation Benchmarking AI Agents
Frameworks, platforms, and harnesses for systematically testing, benchmarking, and evaluating autonomous agent performance across capabilities like tool-use, reasoning, cost-efficiency, and safety. Does NOT include agent building frameworks, deployment infrastructure, or multi-agent competition environments designed primarily for training rather than evaluation.
There are 149 agent evaluation benchmarking agents tracked. 1 score above 70 (verified tier). The highest-rated is StonyBrookNLP/appworld at 72/100 with 388 stars and 771 monthly downloads. 1 of the top 10 are actively maintained.
Get all 149 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=agent-evaluation-benchmarking&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Agent | Score | Tier |
|---|---|---|---|
| 1 |
StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking... |
|
Verified |
| 2 |
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform |
|
Established |
| 3 |
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows |
|
Established |
| 4 |
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and... |
|
Established |
| 5 |
agentscope-ai/OpenJudge
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards |
|
Established |
| 6 |
SparkBeyond/agentune
Tune your AI Agent to best meet its KPI with a cyclic process of analyze,... |
|
Established |
| 7 |
dreadnode/AIRTBench-Code
Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming... |
|
Established |
| 8 |
hidai25/eval-view
Regression testing for AI agents. Snapshot behavior, diff tool calls, catch ... |
|
Established |
| 9 |
RouteWorks/RouterArena
RouterArena: An open framework for evaluating LLM routers with standardized... |
|
Established |
| 10 |
steel-dev/leaderboard
Open leaderboard for browser agents |
|
Emerging |
| 11 |
alepot55/agentrial
Statistical evaluation framework for AI agents |
|
Emerging |
| 12 |
Farama-Foundation/chatarena
ChatArena (or Chat Arena) is a Multi-Agent Language Game Environments for... |
|
Emerging |
| 13 |
SAILResearch/awesome-foundation-model-leaderboards
A curated list of awesome leaderboard-oriented resources for AI domain |
|
Emerging |
| 14 |
ag2ai/Agents_Failure_Attribution
Benchmark for automated failure attributions in agentic systems (🏆 ICML 2025... |
|
Emerging |
| 15 |
rungalileo/agent-leaderboard
Ranking LLMs on agentic tasks |
|
Emerging |
| 16 |
ltzheng/agent-studio
[ICLR 2025] A trinity of environments, tools, and benchmarks for general... |
|
Emerging |
| 17 |
Cognitive-AI-Systems/pogema-benchmark
This is an umbrella repository that contains links and information about all... |
|
Emerging |
| 18 |
justindobbs/Tracecore
Deterministic runtime for agent evaluation |
|
Emerging |
| 19 |
SWE-bench/swe-bench.github.io
Landing page + leaderboard for SWE-Bench benchmark |
|
Emerging |
| 20 |
AISmithLab/HumanStudy-Bench
HumanStudy-Bench: Towards AI Agent Design for Participant Simulation |
|
Emerging |
| 21 |
geval-labs/geval
Eval-driven release gates for AI applications |
|
Emerging |
| 22 |
plaited/agent-eval-harness
Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters... |
|
Emerging |
| 23 |
laiso/ts-bench
Measure and compare the performance of AI coding agents on TypeScript tasks. |
|
Emerging |
| 24 |
Vexp-ai/vexp-swe-bench
Open benchmark for AI coding agents on SWE-bench Verified. Compare... |
|
Emerging |
| 25 |
biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks
Safety challenges for RL and LLM agents' ability to learn and use... |
|
Emerging |
| 26 |
shubchat/loab
LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending... |
|
Emerging |
| 27 |
HumanStudy-Hub/HumanStudy-Bench
HumanStudy-Bench: Community Edition — Standardized human study replays for... |
|
Emerging |
| 28 |
jackjin1997/AgentBench-Live
The open benchmark for AI agent task execution. Claude Code vs Gemini CLI —... |
|
Emerging |
| 29 |
lechmazur/elimination_game
A multi-player tournament benchmark that tests LLMs in social reasoning,... |
|
Emerging |
| 30 |
future-agi/futureagi-sdk
Production-grade AI evaluation, prompt management & observability SDK.... |
|
Emerging |
| 31 |
wallezhang/agent-eval
A YAML-config-driven CLI tool for evaluating AI agents |
|
Emerging |
| 32 |
CosmosYi/AutoControl-Arena
🛡️AutoControl Arena: Synthesizing Executable Test Environments for Frontier... |
|
Emerging |
| 33 |
OpenSymbolicAI/benchmark-py-legalbench
LegalBench benchmark: GoalSeeking agent for 162 legal reasoning tasks |
|
Emerging |
| 34 |
Privatris/AgentLeak
AgentLeak: Open benchmark for privacy leakage in LLM agents — 7 channels,... |
|
Emerging |
| 35 |
itbench-hub/ITBench-Scenarios
⚠️ ARCHIVED - All development moved to... |
|
Emerging |
| 36 |
elliot736/modelab
Open-source A/B testing framework for LLM systems with deterministic... |
|
Emerging |
| 37 |
LeoYeAI/myclaw-bench
The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers.... |
|
Emerging |
| 38 |
8monkey-ai/hebo-evals
Markdown for Evals, a human-first format |
|
Emerging |
| 39 |
StonyBrookNLP/appworld-leaderboard
🌍 Leaderboard Repository for "AppWorld: A Controllable World of Apps and... |
|
Emerging |
| 40 |
yjyddq/RiOSWorld
[NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of... |
|
Emerging |
| 41 |
vectorize-io/agent-memory-benchmark
Agent Memory Benchmark |
|
Emerging |
| 42 |
campfirein/brv-bench
Benchmark suite for evaluating retrieval quality and latency of AI agent... |
|
Emerging |
| 43 |
nottelabs/open-operator-evals
Opensource benchmark evaluating web operators/agents performance |
|
Emerging |
| 44 |
stchakwdev/Secret_H_Evals
Multi-agent strategic deception evaluation framework for LLMs using Secret... |
|
Emerging |
| 45 |
Icarus603/tech-innovation-eval-agent
企业科创能力评估智能体 |
|
Experimental |
| 46 |
BUAA-CLab/CircuitMind
The code about TC-Bench and CircuitMind |
|
Experimental |
| 47 |
lechmazur/step_game
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception... |
|
Experimental |
| 48 |
madhavkrishangarg/ReviewEval
ReviewEval: An Evaluation Framework for AI-Generated Reviews |
|
Experimental |
| 49 |
sstklen/washin-api-benchmark
From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing... |
|
Experimental |
| 50 |
xyva-yuangui/smartness-eval
🎯 12-Dimension AI Agent Intelligence Assessment | 12维度 AI Agent 智能度自动评估技能 |... |
|
Experimental |
| 51 |
DUBSOpenHub/shadow-score-spec
A framework-agnostic metric for measuring AI code generation quality.... |
|
Experimental |
| 52 |
Terminus-Lab/themis
LLM evaluation service with validated judges. Multi-dimensional scoring... |
|
Experimental |
| 53 |
4xxpray/ai-eval
🤖 Evaluate and optimize LLM prompts with multi-provider support, rich... |
|
Experimental |
| 54 |
yotambraun/Toolscore
Python framework for evaluating LLM tool-calling behavior with comprehensive... |
|
Experimental |
| 55 |
lechmazur/pgg_bench
Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent... |
|
Experimental |
| 56 |
clouatre-labs/llm-agent-experiments
Benchmarking open-weight LLM coding agents as SCOUT delegates: model... |
|
Experimental |
| 57 |
justindobbs/awesome-certified-agents
A community catalog of autonomous agents and bundles certified by passing... |
|
Experimental |
| 58 |
IlyasFardaouix/Agent-racing-league
The world's first racing league for AI agents. Think F1 ,but the drivers are AI. |
|
Experimental |
| 59 |
mlbio-epfl/HeurekaBench
[ICLR 2026] A framework to "create benchmarks" and "evaluate AI... |
|
Experimental |
| 60 |
melchiorhering/GUI-OS-AI-Agent-Benchmarking
A modular framework for benchmarking multimodal AI agents in a reproducible,... |
|
Experimental |
| 61 |
yazcaleb/can-is-not-may
Authority Models for Governable AI Agents — paper, AuthorityBench (54... |
|
Experimental |
| 62 |
pauldebdeep9/awesome-agentic-evaluation
A curated list of benchmarks, environments, papers, and tooling for agentic... |
|
Experimental |
| 63 |
mireya001/evalops-kit
CI-native evals for tool-using agents: datasets, traces, deterministic... |
|
Experimental |
| 64 |
kadubon/search-stability-lab
Theory-to-experiment lab for search stability in long-running agents under... |
|
Experimental |
| 65 |
digital-rain-tech/ara-eval
ARA-Eval: Agentic Readiness Assessment — evaluation framework for... |
|
Experimental |
| 66 |
yiyangzhang-ai/open-agent-eval
Lightweight open-source toolkit for evaluating tool-calling AI agents on... |
|
Experimental |
| 67 |
AaronZhou-THU/agent-eval-workbench
A practical workbench for prompt, model, and mocked workflow evaluation with... |
|
Experimental |
| 68 |
tsanthoshreddy/agent-qa-lab
Trace-aware regression harness for tool-using Strands agents with... |
|
Experimental |
| 69 |
Ethandata/crucible-sim
Crucible — The Economic Autonomy Standard. Stress-test AI agents under... |
|
Experimental |
| 70 |
MukundaKatta/AgentBench
Agent evaluation and benchmarking suite — accuracy, efficiency, and tool... |
|
Experimental |
| 71 |
Vinashu/razor-cascade
Framework to benchmark same-provider LLM cascading and measure API cost,... |
|
Experimental |
| 72 |
choutos/agent-eval-framework
Lightweight, practical evaluation framework for AI agents in production.... |
|
Experimental |
| 73 |
dario-github/agent-self-evolution
Automated evaluation, ablation testing, and continuous improvement framework... |
|
Experimental |
| 74 |
ristponex/awesome-minimax-m2.7
🧠 Awesome MiniMax M2.7 — Self-evolving coding AI. Integrations, benchmarks,... |
|
Experimental |
| 75 |
davidgracemann/statma
stat-my-agent ; benchmark consistency, tool-use, failure-recovery and... |
|
Experimental |
| 76 |
evan66547/Contract-Reviewer-Agent-Eval
⚖️ Benchmark evaluation framework for AI-powered legal contract review... |
|
Experimental |
| 77 |
dairongzhen3-creator/illusion-of-emergence
Why your multi-agent LLM deception experiment might be measuring prompt... |
|
Experimental |
| 78 |
alexmar07/agent-arena
A self-regulating arena where AI agents compete for work through sealed-bid auctions |
|
Experimental |
| 79 |
dikatwoone/FluxCodeBench
🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench,... |
|
Experimental |
| 80 |
BayramAnnakov/eval-coach
Agent Skill for Evaluation-Driven Development (EDD) - guide AI evaluation... |
|
Experimental |
| 81 |
nagu-io/agent-settlement-bench
Benchmark for evaluating safety of AI agents in irreversible financial... |
|
Experimental |
| 82 |
ian-flores/securebench
Evaluation and benchmarking framework for R LLM agents |
|
Experimental |
| 83 |
NeoSkillFactory/llm-benchmark
Automatically benchmarks LLM responses across multiple models using... |
|
Experimental |
| 84 |
leaderboard-md/spec
LEADERBOARD.md — Open standard for AI agent performance benchmarking. Track... |
|
Experimental |
| 85 |
The-Swarm-Corporation/ModelArena
ModelArena: A Competitive Environment for Multi-Agent Training |
|
Experimental |
| 86 |
GZQKCHQM/M_bench
Measure Apple Silicon performance for Python and NumPy workloads, providing... |
|
Experimental |
| 87 |
azurefr/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents
Benchmark autonomous AI agents by measuring their reasoning and competitive... |
|
Experimental |
| 88 |
joshualamerton/agent-evaluation-lab
Sandbox platform for testing and evaluating autonomous agents |
|
Experimental |
| 89 |
osheryadgar/tendedloop-arena
Python SDK for TendedLoop Arena — multi-agent gamification research... |
|
Experimental |
| 90 |
Parslee-ai/statebench
Conformance test for stateful AI agents. Measures state correctness over time. |
|
Experimental |
| 91 |
Syncause/syncause-benchmark
AI-driven RCA benchmark evaluating Syncause’s accuracy, interpretability,... |
|
Experimental |
| 92 |
datalayer-challenges/dabench-leaderboard
🤖 A2A-compatible DABench evaluation leaderboard with AgentBeats architecture. |
|
Experimental |
| 93 |
someonehereexists/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents
AI Arena is a competitive evaluation framework where multiple AI agents... |
|
Experimental |
| 94 |
widingmarcus-cyber/opengym
240 challenges to test if your AI agent actually works — not just the model,... |
|
Experimental |
| 95 |
AnLuo1/Assisted-DS
This is the official page of the paper "AssistedDS: Benchmarking How... |
|
Experimental |
| 96 |
dataanswer/awesome-agent-benchmarks
A curated collection of the world’s most advanced benchmark datasets for... |
|
Experimental |
| 97 |
FishIntelGlobal/uncertainty-axioms
Computational validation suite for The First Principles of Uncertainty... |
|
Experimental |
| 98 |
eliumusk/agentreflect
AI agent self-reflection & self-evaluation tool. Built by an AI, for AIs. |
|
Experimental |
| 99 |
thisisyoussef/ghostfolio-agent-eval-dataset
Deterministic golden eval dataset for finance-domain agent testing... |
|
Experimental |
| 100 |
akshan-main/equitas-benchmark
Corruption-robustness benchmark for hierarchical multi-LLM committees |
|
Experimental |
| 101 |
messeb/py-deepeval-behave-bdd-testing-example
An example that combines Behave (BDD testing) with DeepEval (LLM evaluation)... |
|
Experimental |
| 102 |
jonradoff/hiddenbench
HiddenBench: Benchmark for evaluating collective reasoning in multi-agent LLM systems |
|
Experimental |
| 103 |
manishklach/agentic_cpu_bottleneck_bench
Vendor-neutral simulator + benchmark for agent runtime overhead: fan-out,... |
|
Experimental |
| 104 |
Pashasan/llm_price_sensitivity_evaluation
Conjoint experiment measuring price sensitivity and economic preferences of... |
|
Experimental |
| 105 |
jstilb/meaningful_metrics
Open-source evaluation frameworks for human-centered metrics, AI evaluation... |
|
Experimental |
| 106 |
zahere/stochastic-circuit-breaker
Statistically optimal circuit breaker for stochastic systems. 4-state... |
|
Experimental |
| 107 |
robobobby/agenteval
Behavior test framework for AI agents. Define tests in YAML. Run against... |
|
Experimental |
| 108 |
deathlabs/sunshower
Declarative and Distributed Benchmarking for AI Agents |
|
Experimental |
| 109 |
SainathPattipati/agent-evaluation-harness
Framework to benchmark and evaluate multi-agent system performance,... |
|
Experimental |
| 110 |
HomenShum/nodebench-boilerplate
Production-ready boilerplate for AI agent projects using NodeBench MCP. 129... |
|
Experimental |
| 111 |
1sdeb/sidemind.ai
AI Assurance Metrics Analyzer - Evaluate LLM outputs with 15 quality... |
|
Experimental |
| 112 |
fraction12/open-rank
The open benchmark for AI agents — daily puzzles, public rankings |
|
Experimental |
| 113 |
greynewell/swe-bench-pro-action
GitHub Action for SWE-bench Pro evaluation powered by mcpbr |
|
Experimental |
| 114 |
jstilb/llm-eval-framework
LLM evaluation framework with custom metrics, LLM-as-judge, and... |
|
Experimental |
| 115 |
speed785/evalforge
Agent Evaluation Harness — write repeatable, measurable evals for AI agents.... |
|
Experimental |
| 116 |
diorwave/agent-playground
A minimal sandbox to run, score, and compare AI agent outputs locally. |
|
Experimental |
| 117 |
pyros-projects/agent-comparison
Qualitative benchmark suite for evaluating AI coding agents and... |
|
Experimental |
| 118 |
The-Swarm-Corporation/Xray-Bench
XRayBench is a state-of-the-art evaluation platform designed specifically... |
|
Experimental |
| 119 |
axxafo/awesome-agent-benchmarks
🧠 Discover and evaluate advanced benchmark datasets for Large Language Model... |
|
Experimental |
| 120 |
vvsotnikov/astro-bench
Can AI agents do real science? Benchmarking AI agents on KASCADE cosmic ray... |
|
Experimental |
| 121 |
vectorize-io/hindsight-benchmarks
Hindsight Benchmarks Results |
|
Experimental |
| 122 |
Jesutofunmie/Haiku-4.5-vs-Minimax-2.1
🧠 Benchmark Haiku 4.5 and MiniMax M2.1 on agentic tasks, revealing strengths... |
|
Experimental |
| 123 |
josephsenior/agent-evaluation-platform
🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider... |
|
Experimental |
| 124 |
tostechbr/evoloop
Framework-agnostic eval toolkit for AI agents — capture traces, judge... |
|
Experimental |
| 125 |
BAAI-Agents/SWITCH
SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in... |
|
Experimental |
| 126 |
crabsatellite/lem-experiments
Reproducible experiments for: LLM Exposure Monitoring — A Security Framework... |
|
Experimental |
| 127 |
graciegould/agent-performance-tests
Benchmarks how codebase structure affects AI agent efficiency — tool calls,... |
|
Experimental |
| 128 |
memstate-ai/memstate-benchmark
Open-source benchmark for AI agent memory systems — compare Memstate, mem0,... |
|
Experimental |
| 129 |
avdolgikh/poker-coach-eval-harness
LLM-powered evaluation harness for detecting orchestration failures in AI... |
|
Experimental |
| 130 |
Ritvik777/Galileo_Project
Galileo: Observations and Evals |
|
Experimental |
| 131 |
jamjet-labs/jamjet-benchmarks
JamJet benchmarks, migration guides, and feature comparisons vs LangGraph,... |
|
Experimental |
| 132 |
lintware/AI_Agent_Frameworks_Comparison
Benchmark comparing 8 AI agent frameworks (SmolAgents, OpenAI Agents SDK,... |
|
Experimental |
| 133 |
memvid/memvidbench
Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational... |
|
Experimental |
| 134 |
patrikmarshall/opencode-benchmark-dashboard
Measure and compare speed and accuracy of large language models using... |
|
Experimental |
| 135 |
Emersoft76/ai-agent-systems-advanced-benchmarking
Modular AI agent system with LLMs, tools, and benchmark optimization |
|
Experimental |
| 136 |
Lap-Platform/Lap-benchmark-docs
LAP benchmark results — 500 runs, 50 specs, 5 formats. Agents run 35%... |
|
Experimental |
| 137 |
Red1-Rahman/Prompt-Eval
Streamlit prompt evaluation tool that auto-generates test cases, run evals,... |
|
Experimental |
| 138 |
Software-Engineering-Arena/SWE-Agent-Arena
Compare agents pairwise via multi‑round evaluations for SE tasks. |
|
Experimental |
| 139 |
Jojodicus/ai-identity-benchmark
Does the identity in a system prompt change performance? |
|
Experimental |
| 140 |
brianjmarvin/datasnack-ai
The DataSnack AI Agent Evaluator is a CLI tool that automates the testing of... |
|
Experimental |
| 141 |
mohsinsheikhani/support-fte-evals
Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent... |
|
Experimental |
| 142 |
yzotop/ab-factory-demo
Deterministic multi-agent A/B test evaluation system with policy engine,... |
|
Experimental |
| 143 |
EmZod/Haiku-4.5-vs-Minimax-2.1
Systematic benchmark comparing Claude Haiku 4.5 vs MiniMax M2.1 on agentic... |
|
Experimental |
| 144 |
ImSudhakar07/RivalReview-Evals
An eval platform that continuously monitors the quality of the /RivalReview... |
|
Experimental |
| 145 |
prajaktapandit7/conversational-AI-evaluation
Structured evaluation of 30 support bot conversations measuring containment,... |
|
Experimental |
| 146 |
EmZod/Earth-Magnetic-Field-Research-Minimax-w-subagents-in-pi-
Multi-agent research orchestration using MiniMax-M2.1 with thinking enabled.... |
|
Experimental |
| 147 |
codedbyelif/els-judge
Multi-LLM consensus engine for automated code review, diff analysis, and... |
|
Experimental |
| 148 |
abhi9avx/deepeval-llm-evaluation
LLM & RAG evaluation framework using DeepEval. Includes 11+ executable tests... |
|
Experimental |
| 149 |
corradocavalli/agentic_evaluation
Demonstration of testing and evaluation patterns for AI agents using Azure... |
|
Experimental |