Agent Evaluation Benchmarking AI Agents

Frameworks, platforms, and harnesses for systematically testing, benchmarking, and evaluating autonomous agent performance across capabilities like tool-use, reasoning, cost-efficiency, and safety. Does NOT include agent building frameworks, deployment infrastructure, or multi-agent competition environments designed primarily for training rather than evaluation.

There are 149 agent evaluation benchmarking agents tracked. 1 score above 70 (verified tier). The highest-rated is StonyBrookNLP/appworld at 72/100 with 388 stars and 771 monthly downloads. 1 of the top 10 are actively maintained.

Get all 149 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=agent-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Agent	Score	Tier	Stars	Language
1	StonyBrookNLP/appworld 🌍 AppWorld: A Controllable World of Apps and People for Benchmarking...	72	Verified	388	Python
2	qualifire-dev/rogue AI Agent Evaluator & Red Team Platform	61	Established	1,012	Python
3	future-agi/ai-evaluation Evaluation Framework for all your AI related Workflows	57	Established	84	Python
4	microsoft/WindowsAgentArena Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and...	56	Established	833	Python
5	agentscope-ai/OpenJudge OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards	53	Established	459	Python
6	SparkBeyond/agentune Tune your AI Agent to best meet its KPI with a cyclic process of analyze,...	53	Established	36	Python
7	dreadnode/AIRTBench-Code Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming...	53	Established	93	Jupyter Notebook
8	hidai25/eval-view Regression testing for AI agents. Snapshot behavior, diff tool calls, catch ...	52	Established	63	Python
9	RouteWorks/RouterArena RouterArena: An open framework for evaluating LLM routers with standardized...	50	Established	71	Python
10	steel-dev/leaderboard Open leaderboard for browser agents	49	Emerging	31	Astro
11	alepot55/agentrial Statistical evaluation framework for AI agents	49	Emerging	15	Python
12	Farama-Foundation/chatarena ChatArena (or Chat Arena) is a Multi-Agent Language Game Environments for...	48	Emerging	1,540	Python
13	SAILResearch/awesome-foundation-model-leaderboards A curated list of awesome leaderboard-oriented resources for AI domain	48	Emerging	321	—
14	ag2ai/Agents_Failure_Attribution Benchmark for automated failure attributions in agentic systems (🏆 ICML 2025...	48	Emerging	349	Python
15	rungalileo/agent-leaderboard Ranking LLMs on agentic tasks	47	Emerging	217	Jupyter Notebook
16	ltzheng/agent-studio [ICLR 2025] A trinity of environments, tools, and benchmarks for general...	45	Emerging	229	Python
17	Cognitive-AI-Systems/pogema-benchmark This is an umbrella repository that contains links and information about all...	44	Emerging	35	C++
18	justindobbs/Tracecore Deterministic runtime for agent evaluation	42	Emerging	7	Python
19	SWE-bench/swe-bench.github.io Landing page + leaderboard for SWE-Bench benchmark	42	Emerging	12	JavaScript
20	AISmithLab/HumanStudy-Bench HumanStudy-Bench: Towards AI Agent Design for Participant Simulation	41	Emerging	12	Python
21	geval-labs/geval Eval-driven release gates for AI applications	40	Emerging	14	TypeScript
22	plaited/agent-eval-harness Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters...	39	Emerging	2	TypeScript
23	laiso/ts-bench Measure and compare the performance of AI coding agents on TypeScript tasks.	39	Emerging	210	TypeScript
24	Vexp-ai/vexp-swe-bench Open benchmark for AI coding agents on SWE-bench Verified. Compare...	38	Emerging	5	Shell
25	biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks Safety challenges for RL and LLM agents' ability to learn and use...	38	Emerging	7	Python
26	shubchat/loab LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending...	38	Emerging	5	Python
27	HumanStudy-Hub/HumanStudy-Bench HumanStudy-Bench: Community Edition — Standardized human study replays for...	38	Emerging	3	Python
28	jackjin1997/AgentBench-Live The open benchmark for AI agent task execution. Claude Code vs Gemini CLI —...	37	Emerging	3	Python
29	lechmazur/elimination_game A multi-player tournament benchmark that tests LLMs in social reasoning,...	37	Emerging	302	—
30	future-agi/futureagi-sdk Production-grade AI evaluation, prompt management & observability SDK....	36	Emerging	37	Python
31	wallezhang/agent-eval A YAML-config-driven CLI tool for evaluating AI agents	36	Emerging	2	Go
32	CosmosYi/AutoControl-Arena 🛡️AutoControl Arena: Synthesizing Executable Test Environments for Frontier...	35	Emerging	7	Python
33	OpenSymbolicAI/benchmark-py-legalbench LegalBench benchmark: GoalSeeking agent for 162 legal reasoning tasks	35	Emerging	1	Python
34	Privatris/AgentLeak AgentLeak: Open benchmark for privacy leakage in LLM agents — 7 channels,...	35	Emerging	9	Python
35	itbench-hub/ITBench-Scenarios ⚠️ ARCHIVED - All development moved to...	35	Emerging	15	Python
36	elliot736/modelab Open-source A/B testing framework for LLM systems with deterministic...	35	Emerging	2	Python
37	LeoYeAI/myclaw-bench The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers....	35	Emerging	1	Python
38	8monkey-ai/hebo-evals Markdown for Evals, a human-first format	33	Emerging	2	TypeScript
39	StonyBrookNLP/appworld-leaderboard 🌍 Leaderboard Repository for "AppWorld: A Controllable World of Apps and...	32	Emerging	6	Python
40	yjyddq/RiOSWorld [NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of...	32	Emerging	117	HTML
41	vectorize-io/agent-memory-benchmark Agent Memory Benchmark	31	Emerging	11	Python
42	campfirein/brv-bench Benchmark suite for evaluating retrieval quality and latency of AI agent...	31	Emerging	11	Python
43	nottelabs/open-operator-evals Opensource benchmark evaluating web operators/agents performance	30	Emerging	47	Python
44	stchakwdev/Secret_H_Evals Multi-agent strategic deception evaluation framework for LLMs using Secret...	30	Emerging	3	Python
45	Icarus603/tech-innovation-eval-agent 企业科创能力评估智能体	28	Experimental	1	Python
46	BUAA-CLab/CircuitMind The code about TC-Bench and CircuitMind	28	Experimental	8	Python
47	lechmazur/step_game Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception...	27	Experimental	85	—
48	madhavkrishangarg/ReviewEval ReviewEval: An Evaluation Framework for AI-Generated Reviews	26	Experimental	3	Python
49	sstklen/washin-api-benchmark From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing...	26	Experimental	3	—
50	xyva-yuangui/smartness-eval 🎯 12-Dimension AI Agent Intelligence Assessment \| 12维度 AI Agent 智能度自动评估技能 \|...	24	Experimental	2	Python
51	DUBSOpenHub/shadow-score-spec A framework-agnostic metric for measuring AI code generation quality....	24	Experimental	2	Python
52	Terminus-Lab/themis LLM evaluation service with validated judges. Multi-dimensional scoring...	24	Experimental	2	Go
53	4xxpray/ai-eval 🤖 Evaluate and optimize LLM prompts with multi-provider support, rich...	23	Experimental	1	Go
54	yotambraun/Toolscore Python framework for evaluating LLM tool-calling behavior with comprehensive...	23	Experimental	5	Python
55	lechmazur/pgg_bench Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent...	23	Experimental	39	—
56	clouatre-labs/llm-agent-experiments Benchmarking open-weight LLM coding agents as SCOUT delegates: model...	23	Experimental	1	Python
57	justindobbs/awesome-certified-agents A community catalog of autonomous agents and bundles certified by passing...	23	Experimental	1	Python
58	IlyasFardaouix/Agent-racing-league The world's first racing league for AI agents. Think F1 ,but the drivers are AI.	23	Experimental	1	—
59	mlbio-epfl/HeurekaBench [ICLR 2026] A framework to "create benchmarks" and "evaluate AI...	23	Experimental	10	Python
60	melchiorhering/GUI-OS-AI-Agent-Benchmarking A modular framework for benchmarking multimodal AI agents in a reproducible,...	23	Experimental	1	Jupyter Notebook
61	yazcaleb/can-is-not-may Authority Models for Governable AI Agents — paper, AuthorityBench (54...	23	Experimental	1	TeX
62	pauldebdeep9/awesome-agentic-evaluation A curated list of benchmarks, environments, papers, and tooling for agentic...	23	Experimental	1	—
63	mireya001/evalops-kit CI-native evals for tool-using agents: datasets, traces, deterministic...	22	Experimental	—	Python
64	kadubon/search-stability-lab Theory-to-experiment lab for search stability in long-running agents under...	22	Experimental	—	Python
65	digital-rain-tech/ara-eval ARA-Eval: Agentic Readiness Assessment — evaluation framework for...	22	Experimental	—	Python
66	yiyangzhang-ai/open-agent-eval Lightweight open-source toolkit for evaluating tool-calling AI agents on...	22	Experimental	—	Python
67	AaronZhou-THU/agent-eval-workbench A practical workbench for prompt, model, and mocked workflow evaluation with...	22	Experimental	—	Python
68	tsanthoshreddy/agent-qa-lab Trace-aware regression harness for tool-using Strands agents with...	22	Experimental	—	Python
69	Ethandata/crucible-sim Crucible — The Economic Autonomy Standard. Stress-test AI agents under...	22	Experimental	—	Python
70	MukundaKatta/AgentBench Agent evaluation and benchmarking suite — accuracy, efficiency, and tool...	22	Experimental	—	Python
71	Vinashu/razor-cascade Framework to benchmark same-provider LLM cascading and measure API cost,...	22	Experimental	—	TypeScript
72	choutos/agent-eval-framework Lightweight, practical evaluation framework for AI agents in production....	22	Experimental	—	Shell
73	dario-github/agent-self-evolution Automated evaluation, ablation testing, and continuous improvement framework...	22	Experimental	—	Python
74	ristponex/awesome-minimax-m2.7 🧠 Awesome MiniMax M2.7 — Self-evolving coding AI. Integrations, benchmarks,...	22	Experimental	—	—
75	davidgracemann/statma stat-my-agent ; benchmark consistency, tool-use, failure-recovery and...	22	Experimental	—	Python
76	evan66547/Contract-Reviewer-Agent-Eval ⚖️ Benchmark evaluation framework for AI-powered legal contract review...	22	Experimental	—	Python
77	dairongzhen3-creator/illusion-of-emergence Why your multi-agent LLM deception experiment might be measuring prompt...	22	Experimental	—	—
78	alexmar07/agent-arena A self-regulating arena where AI agents compete for work through sealed-bid auctions	22	Experimental	—	Python
79	dikatwoone/FluxCodeBench 🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench,...	22	Experimental	—	Python
80	BayramAnnakov/eval-coach Agent Skill for Evaluation-Driven Development (EDD) - guide AI evaluation...	22	Experimental	3	Python
81	nagu-io/agent-settlement-bench Benchmark for evaluating safety of AI agents in irreversible financial...	22	Experimental	3	JavaScript
82	ian-flores/securebench Evaluation and benchmarking framework for R LLM agents	22	Experimental	—	R
83	NeoSkillFactory/llm-benchmark Automatically benchmarks LLM responses across multiple models using...	22	Experimental	—	JavaScript
84	leaderboard-md/spec LEADERBOARD.md — Open standard for AI agent performance benchmarking. Track...	22	Experimental	—	HTML
85	The-Swarm-Corporation/ModelArena ModelArena: A Competitive Environment for Multi-Agent Training	22	Experimental	9	Python
86	GZQKCHQM/M_bench Measure Apple Silicon performance for Python and NumPy workloads, providing...	22	Experimental	—	Python
87	azurefr/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents Benchmark autonomous AI agents by measuring their reasoning and competitive...	22	Experimental	—	Python
88	joshualamerton/agent-evaluation-lab Sandbox platform for testing and evaluating autonomous agents	22	Experimental	—	Python
89	osheryadgar/tendedloop-arena Python SDK for TendedLoop Arena — multi-agent gamification research...	22	Experimental	—	Python
90	Parslee-ai/statebench Conformance test for stateful AI agents. Measures state correctness over time.	22	Experimental	4	Python
91	Syncause/syncause-benchmark AI-driven RCA benchmark evaluating Syncause’s accuracy, interpretability,...	20	Experimental	10	Python
92	datalayer-challenges/dabench-leaderboard 🤖 A2A-compatible DABench evaluation leaderboard with AgentBeats architecture.	20	Experimental	1	Python
93	someonehereexists/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents AI Arena is a competitive evaluation framework where multiple AI agents...	20	Experimental	1	Python
94	widingmarcus-cyber/opengym 240 challenges to test if your AI agent actually works — not just the model,...	20	Experimental	1	Python
95	AnLuo1/Assisted-DS This is the official page of the paper "AssistedDS: Benchmarking How...	19	Experimental	5	Python
96	dataanswer/awesome-agent-benchmarks A curated collection of the world’s most advanced benchmark datasets for...	19	Experimental	5	—
97	FishIntelGlobal/uncertainty-axioms Computational validation suite for The First Principles of Uncertainty...	19	Experimental	—	Python
98	eliumusk/agentreflect AI agent self-reflection & self-evaluation tool. Built by an AI, for AIs.	19	Experimental	—	Python
99	thisisyoussef/ghostfolio-agent-eval-dataset Deterministic golden eval dataset for finance-domain agent testing...	19	Experimental	—	—
100	akshan-main/equitas-benchmark Corruption-robustness benchmark for hierarchical multi-LLM committees	19	Experimental	—	Python
101	messeb/py-deepeval-behave-bdd-testing-example An example that combines Behave (BDD testing) with DeepEval (LLM evaluation)...	19	Experimental	—	Python
102	jonradoff/hiddenbench HiddenBench: Benchmark for evaluating collective reasoning in multi-agent LLM systems	19	Experimental	—	Python
103	manishklach/agentic_cpu_bottleneck_bench Vendor-neutral simulator + benchmark for agent runtime overhead: fan-out,...	19	Experimental	—	Python
104	Pashasan/llm_price_sensitivity_evaluation Conjoint experiment measuring price sensitivity and economic preferences of...	19	Experimental	—	Python
105	jstilb/meaningful_metrics Open-source evaluation frameworks for human-centered metrics, AI evaluation...	19	Experimental	—	Python
106	zahere/stochastic-circuit-breaker Statistically optimal circuit breaker for stochastic systems. 4-state...	19	Experimental	—	Python
107	robobobby/agenteval Behavior test framework for AI agents. Define tests in YAML. Run against...	19	Experimental	—	Python
108	deathlabs/sunshower Declarative and Distributed Benchmarking for AI Agents	19	Experimental	—	Python
109	SainathPattipati/agent-evaluation-harness Framework to benchmark and evaluate multi-agent system performance,...	19	Experimental	—	Python
110	HomenShum/nodebench-boilerplate Production-ready boilerplate for AI agent projects using NodeBench MCP. 129...	19	Experimental	—	TypeScript
111	1sdeb/sidemind.ai AI Assurance Metrics Analyzer - Evaluate LLM outputs with 15 quality...	19	Experimental	—	JavaScript
112	fraction12/open-rank The open benchmark for AI agents — daily puzzles, public rankings	19	Experimental	—	Astro
113	greynewell/swe-bench-pro-action GitHub Action for SWE-bench Pro evaluation powered by mcpbr	19	Experimental	—	Shell
114	jstilb/llm-eval-framework LLM evaluation framework with custom metrics, LLM-as-judge, and...	19	Experimental	—	Python
115	speed785/evalforge Agent Evaluation Harness — write repeatable, measurable evals for AI agents....	19	Experimental	—	Python
116	diorwave/agent-playground A minimal sandbox to run, score, and compare AI agent outputs locally.	18	Experimental	4	Python
117	pyros-projects/agent-comparison Qualitative benchmark suite for evaluating AI coding agents and...	17	Experimental	2	Python
118	The-Swarm-Corporation/Xray-Bench XRayBench is a state-of-the-art evaluation platform designed specifically...	17	Experimental	2	Python
119	axxafo/awesome-agent-benchmarks 🧠 Discover and evaluate advanced benchmark datasets for Large Language Model...	17	Experimental	3	—
120	vvsotnikov/astro-bench Can AI agents do real science? Benchmarking AI agents on KASCADE cosmic ray...	17	Experimental	4	Python
121	vectorize-io/hindsight-benchmarks Hindsight Benchmarks Results	16	Experimental	2	Python
122	Jesutofunmie/Haiku-4.5-vs-Minimax-2.1 🧠 Benchmark Haiku 4.5 and MiniMax M2.1 on agentic tasks, revealing strengths...	15	Experimental	1	Shell
123	josephsenior/agent-evaluation-platform 🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider...	15	Experimental	—	Python
124	tostechbr/evoloop Framework-agnostic eval toolkit for AI agents — capture traces, judge...	15	Experimental	—	Python
125	BAAI-Agents/SWITCH SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in...	15	Experimental	5	—
126	crabsatellite/lem-experiments Reproducible experiments for: LLM Exposure Monitoring — A Security Framework...	14	Experimental	—	JavaScript
127	graciegould/agent-performance-tests Benchmarks how codebase structure affects AI agent efficiency — tool calls,...	14	Experimental	—	TypeScript
128	memstate-ai/memstate-benchmark Open-source benchmark for AI agent memory systems — compare Memstate, mem0,...	14	Experimental	—	TypeScript
129	avdolgikh/poker-coach-eval-harness LLM-powered evaluation harness for detecting orchestration failures in AI...	14	Experimental	—	Python
130	Ritvik777/Galileo_Project Galileo: Observations and Evals	14	Experimental	—	Python
131	jamjet-labs/jamjet-benchmarks JamJet benchmarks, migration guides, and feature comparisons vs LangGraph,...	14	Experimental	—	Python
132	lintware/AI_Agent_Frameworks_Comparison Benchmark comparing 8 AI agent frameworks (SmolAgents, OpenAI Agents SDK,...	14	Experimental	—	Python
133	memvid/memvidbench Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational...	14	Experimental	4	TypeScript
134	patrikmarshall/opencode-benchmark-dashboard Measure and compare speed and accuracy of large language models using...	14	Experimental	—	—
135	Emersoft76/ai-agent-systems-advanced-benchmarking Modular AI agent system with LLMs, tools, and benchmark optimization	12	Experimental	1	Python
136	Lap-Platform/Lap-benchmark-docs LAP benchmark results — 500 runs, 50 specs, 5 formats. Agents run 35%...	12	Experimental	1	HTML
137	Red1-Rahman/Prompt-Eval Streamlit prompt evaluation tool that auto-generates test cases, run evals,...	12	Experimental	1	Python
138	Software-Engineering-Arena/SWE-Agent-Arena Compare agents pairwise via multi‑round evaluations for SE tasks.	12	Experimental	1	JavaScript
139	Jojodicus/ai-identity-benchmark Does the identity in a system prompt change performance?	11	Experimental	—	Python
140	brianjmarvin/datasnack-ai The DataSnack AI Agent Evaluator is a CLI tool that automates the testing of...	11	Experimental	—	Go
141	mohsinsheikhani/support-fte-evals Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent...	11	Experimental	—	Python
142	yzotop/ab-factory-demo Deterministic multi-agent A/B test evaluation system with policy engine,...	11	Experimental	—	Python
143	EmZod/Haiku-4.5-vs-Minimax-2.1 Systematic benchmark comparing Claude Haiku 4.5 vs MiniMax M2.1 on agentic...	11	Experimental	—	Shell
144	ImSudhakar07/RivalReview-Evals An eval platform that continuously monitors the quality of the /RivalReview...	11	Experimental	—	Python
145	prajaktapandit7/conversational-AI-evaluation Structured evaluation of 30 support bot conversations measuring containment,...	11	Experimental	—	—
146	EmZod/Earth-Magnetic-Field-Research-Minimax-w-subagents-in-pi- Multi-agent research orchestration using MiniMax-M2.1 with thinking enabled....	11	Experimental	—	HTML
147	codedbyelif/els-judge Multi-LLM consensus engine for automated code review, diff analysis, and...	11	Experimental	—	Python
148	abhi9avx/deepeval-llm-evaluation LLM & RAG evaluation framework using DeepEval. Includes 11+ executable tests...	11	Experimental	—	Python
149	corradocavalli/agentic_evaluation Demonstration of testing and evaluation patterns for AI agents using Azure...	11	Experimental	—	Python

Comparisons in this category

ai-evaluation and agentrial (57 vs 49)