Agent Evaluation Benchmarking AI Agents

Frameworks, platforms, and harnesses for systematically testing, benchmarking, and evaluating autonomous agent performance across capabilities like tool-use, reasoning, cost-efficiency, and safety. Does NOT include agent building frameworks, deployment infrastructure, or multi-agent competition environments designed primarily for training rather than evaluation.

There are 149 agent evaluation benchmarking agents tracked. 1 score above 70 (verified tier). The highest-rated is StonyBrookNLP/appworld at 72/100 with 388 stars and 771 monthly downloads. 1 of the top 10 are actively maintained.

Get all 149 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=agents&subcategory=agent-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Agent Score Tier
1 StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking...

72
Verified
2 qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

61
Established
3 future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

57
Established
4 microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and...

56
Established
5 agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

53
Established
6 SparkBeyond/agentune

Tune your AI Agent to best meet its KPI with a cyclic process of analyze,...

53
Established
7 dreadnode/AIRTBench-Code

Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming...

53
Established
8 hidai25/eval-view

Regression testing for AI agents. Snapshot behavior, diff tool calls, catch ...

52
Established
9 RouteWorks/RouterArena

RouterArena: An open framework for evaluating LLM routers with standardized...

50
Established
10 steel-dev/leaderboard

Open leaderboard for browser agents

49
Emerging
11 alepot55/agentrial

Statistical evaluation framework for AI agents

49
Emerging
12 Farama-Foundation/chatarena

ChatArena (or Chat Arena) is a Multi-Agent Language Game Environments for...

48
Emerging
13 SAILResearch/awesome-foundation-model-leaderboards

A curated list of awesome leaderboard-oriented resources for AI domain

48
Emerging
14 ag2ai/Agents_Failure_Attribution

Benchmark for automated failure attributions in agentic systems (🏆 ICML 2025...

48
Emerging
15 rungalileo/agent-leaderboard

Ranking LLMs on agentic tasks

47
Emerging
16 ltzheng/agent-studio

[ICLR 2025] A trinity of environments, tools, and benchmarks for general...

45
Emerging
17 Cognitive-AI-Systems/pogema-benchmark

This is an umbrella repository that contains links and information about all...

44
Emerging
18 justindobbs/Tracecore

Deterministic runtime for agent evaluation

42
Emerging
19 SWE-bench/swe-bench.github.io

Landing page + leaderboard for SWE-Bench benchmark

42
Emerging
20 AISmithLab/HumanStudy-Bench

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

41
Emerging
21 geval-labs/geval

Eval-driven release gates for AI applications

40
Emerging
22 plaited/agent-eval-harness

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters...

39
Emerging
23 laiso/ts-bench

Measure and compare the performance of AI coding agents on TypeScript tasks.

39
Emerging
24 Vexp-ai/vexp-swe-bench

Open benchmark for AI coding agents on SWE-bench Verified. Compare...

38
Emerging
25 biological-alignment-benchmarks/biological-alignment-gridagents-benchmarks

Safety challenges for RL and LLM agents' ability to learn and use...

38
Emerging
26 shubchat/loab

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending...

38
Emerging
27 HumanStudy-Hub/HumanStudy-Bench

HumanStudy-Bench: Community Edition — Standardized human study replays for...

38
Emerging
28 jackjin1997/AgentBench-Live

The open benchmark for AI agent task execution. Claude Code vs Gemini CLI —...

37
Emerging
29 lechmazur/elimination_game

A multi-player tournament benchmark that tests LLMs in social reasoning,...

37
Emerging
30 future-agi/futureagi-sdk

Production-grade AI evaluation, prompt management & observability SDK....

36
Emerging
31 wallezhang/agent-eval

A YAML-config-driven CLI tool for evaluating AI agents

36
Emerging
32 CosmosYi/AutoControl-Arena

🛡️AutoControl Arena: Synthesizing Executable Test Environments for Frontier...

35
Emerging
33 OpenSymbolicAI/benchmark-py-legalbench

LegalBench benchmark: GoalSeeking agent for 162 legal reasoning tasks

35
Emerging
34 Privatris/AgentLeak

AgentLeak: Open benchmark for privacy leakage in LLM agents — 7 channels,...

35
Emerging
35 itbench-hub/ITBench-Scenarios

⚠️ ARCHIVED - All development moved to...

35
Emerging
36 elliot736/modelab

Open-source A/B testing framework for LLM systems with deterministic...

35
Emerging
37 LeoYeAI/myclaw-bench

The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers....

35
Emerging
38 8monkey-ai/hebo-evals

Markdown for Evals, a human-first format

33
Emerging
39 StonyBrookNLP/appworld-leaderboard

🌍 Leaderboard Repository for "AppWorld: A Controllable World of Apps and...

32
Emerging
40 yjyddq/RiOSWorld

[NeurIPS 2025] Official repository of RiOSWorld: Benchmarking the Risk of...

32
Emerging
41 vectorize-io/agent-memory-benchmark

Agent Memory Benchmark

31
Emerging
42 campfirein/brv-bench

Benchmark suite for evaluating retrieval quality and latency of AI agent...

31
Emerging
43 nottelabs/open-operator-evals

Opensource benchmark evaluating web operators/agents performance

30
Emerging
44 stchakwdev/Secret_H_Evals

Multi-agent strategic deception evaluation framework for LLMs using Secret...

30
Emerging
45 Icarus603/tech-innovation-eval-agent

企业科创能力评估智能体

28
Experimental
46 BUAA-CLab/CircuitMind

The code about TC-Bench and CircuitMind

28
Experimental
47 lechmazur/step_game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception...

27
Experimental
48 madhavkrishangarg/ReviewEval

ReviewEval: An Evaluation Framework for AI-Generated Reviews

26
Experimental
49 sstklen/washin-api-benchmark

From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing...

26
Experimental
50 xyva-yuangui/smartness-eval

🎯 12-Dimension AI Agent Intelligence Assessment | 12维度 AI Agent 智能度自动评估技能 |...

24
Experimental
51 DUBSOpenHub/shadow-score-spec

A framework-agnostic metric for measuring AI code generation quality....

24
Experimental
52 Terminus-Lab/themis

LLM evaluation service with validated judges. Multi-dimensional scoring...

24
Experimental
53 4xxpray/ai-eval

🤖 Evaluate and optimize LLM prompts with multi-provider support, rich...

23
Experimental
54 yotambraun/Toolscore

Python framework for evaluating LLM tool-calling behavior with comprehensive...

23
Experimental
55 lechmazur/pgg_bench

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent...

23
Experimental
56 clouatre-labs/llm-agent-experiments

Benchmarking open-weight LLM coding agents as SCOUT delegates: model...

23
Experimental
57 justindobbs/awesome-certified-agents

A community catalog of autonomous agents and bundles certified by passing...

23
Experimental
58 IlyasFardaouix/Agent-racing-league

The world's first racing league for AI agents. Think F1 ,but the drivers are AI.

23
Experimental
59 mlbio-epfl/HeurekaBench

[ICLR 2026] A framework to "create benchmarks" and "evaluate AI...

23
Experimental
60 melchiorhering/GUI-OS-AI-Agent-Benchmarking

A modular framework for benchmarking multimodal AI agents in a reproducible,...

23
Experimental
61 yazcaleb/can-is-not-may

Authority Models for Governable AI Agents — paper, AuthorityBench (54...

23
Experimental
62 pauldebdeep9/awesome-agentic-evaluation

A curated list of benchmarks, environments, papers, and tooling for agentic...

23
Experimental
63 mireya001/evalops-kit

CI-native evals for tool-using agents: datasets, traces, deterministic...

22
Experimental
64 kadubon/search-stability-lab

Theory-to-experiment lab for search stability in long-running agents under...

22
Experimental
65 digital-rain-tech/ara-eval

ARA-Eval: Agentic Readiness Assessment — evaluation framework for...

22
Experimental
66 yiyangzhang-ai/open-agent-eval

Lightweight open-source toolkit for evaluating tool-calling AI agents on...

22
Experimental
67 AaronZhou-THU/agent-eval-workbench

A practical workbench for prompt, model, and mocked workflow evaluation with...

22
Experimental
68 tsanthoshreddy/agent-qa-lab

Trace-aware regression harness for tool-using Strands agents with...

22
Experimental
69 Ethandata/crucible-sim

Crucible — The Economic Autonomy Standard. Stress-test AI agents under...

22
Experimental
70 MukundaKatta/AgentBench

Agent evaluation and benchmarking suite — accuracy, efficiency, and tool...

22
Experimental
71 Vinashu/razor-cascade

Framework to benchmark same-provider LLM cascading and measure API cost,...

22
Experimental
72 choutos/agent-eval-framework

Lightweight, practical evaluation framework for AI agents in production....

22
Experimental
73 dario-github/agent-self-evolution

Automated evaluation, ablation testing, and continuous improvement framework...

22
Experimental
74 ristponex/awesome-minimax-m2.7

🧠 Awesome MiniMax M2.7 — Self-evolving coding AI. Integrations, benchmarks,...

22
Experimental
75 davidgracemann/statma

stat-my-agent ; benchmark consistency, tool-use, failure-recovery and...

22
Experimental
76 evan66547/Contract-Reviewer-Agent-Eval

⚖️ Benchmark evaluation framework for AI-powered legal contract review...

22
Experimental
77 dairongzhen3-creator/illusion-of-emergence

Why your multi-agent LLM deception experiment might be measuring prompt...

22
Experimental
78 alexmar07/agent-arena

A self-regulating arena where AI agents compete for work through sealed-bid auctions

22
Experimental
79 dikatwoone/FluxCodeBench

🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench,...

22
Experimental
80 BayramAnnakov/eval-coach

Agent Skill for Evaluation-Driven Development (EDD) - guide AI evaluation...

22
Experimental
81 nagu-io/agent-settlement-bench

Benchmark for evaluating safety of AI agents in irreversible financial...

22
Experimental
82 ian-flores/securebench

Evaluation and benchmarking framework for R LLM agents

22
Experimental
83 NeoSkillFactory/llm-benchmark

Automatically benchmarks LLM responses across multiple models using...

22
Experimental
84 leaderboard-md/spec

LEADERBOARD.md — Open standard for AI agent performance benchmarking. Track...

22
Experimental
85 The-Swarm-Corporation/ModelArena

ModelArena: A Competitive Environment for Multi-Agent Training

22
Experimental
86 GZQKCHQM/M_bench

Measure Apple Silicon performance for Python and NumPy workloads, providing...

22
Experimental
87 azurefr/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

Benchmark autonomous AI agents by measuring their reasoning and competitive...

22
Experimental
88 joshualamerton/agent-evaluation-lab

Sandbox platform for testing and evaluating autonomous agents

22
Experimental
89 osheryadgar/tendedloop-arena

Python SDK for TendedLoop Arena — multi-agent gamification research...

22
Experimental
90 Parslee-ai/statebench

Conformance test for stateful AI agents. Measures state correctness over time.

22
Experimental
91 Syncause/syncause-benchmark

AI-driven RCA benchmark evaluating Syncause’s accuracy, interpretability,...

20
Experimental
92 datalayer-challenges/dabench-leaderboard

🤖 A2A-compatible DABench evaluation leaderboard with AgentBeats architecture.

20
Experimental
93 someonehereexists/AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

AI Arena is a competitive evaluation framework where multiple AI agents...

20
Experimental
94 widingmarcus-cyber/opengym

240 challenges to test if your AI agent actually works — not just the model,...

20
Experimental
95 AnLuo1/Assisted-DS

This is the official page of the paper "AssistedDS: Benchmarking How...

19
Experimental
96 dataanswer/awesome-agent-benchmarks

A curated collection of the world’s most advanced benchmark datasets for...

19
Experimental
97 FishIntelGlobal/uncertainty-axioms

Computational validation suite for The First Principles of Uncertainty...

19
Experimental
98 eliumusk/agentreflect

AI agent self-reflection & self-evaluation tool. Built by an AI, for AIs.

19
Experimental
99 thisisyoussef/ghostfolio-agent-eval-dataset

Deterministic golden eval dataset for finance-domain agent testing...

19
Experimental
100 akshan-main/equitas-benchmark

Corruption-robustness benchmark for hierarchical multi-LLM committees

19
Experimental
101 messeb/py-deepeval-behave-bdd-testing-example

An example that combines Behave (BDD testing) with DeepEval (LLM evaluation)...

19
Experimental
102 jonradoff/hiddenbench

HiddenBench: Benchmark for evaluating collective reasoning in multi-agent LLM systems

19
Experimental
103 manishklach/agentic_cpu_bottleneck_bench

Vendor-neutral simulator + benchmark for agent runtime overhead: fan-out,...

19
Experimental
104 Pashasan/llm_price_sensitivity_evaluation

Conjoint experiment measuring price sensitivity and economic preferences of...

19
Experimental
105 jstilb/meaningful_metrics

Open-source evaluation frameworks for human-centered metrics, AI evaluation...

19
Experimental
106 zahere/stochastic-circuit-breaker

Statistically optimal circuit breaker for stochastic systems. 4-state...

19
Experimental
107 robobobby/agenteval

Behavior test framework for AI agents. Define tests in YAML. Run against...

19
Experimental
108 deathlabs/sunshower

Declarative and Distributed Benchmarking for AI Agents

19
Experimental
109 SainathPattipati/agent-evaluation-harness

Framework to benchmark and evaluate multi-agent system performance,...

19
Experimental
110 HomenShum/nodebench-boilerplate

Production-ready boilerplate for AI agent projects using NodeBench MCP. 129...

19
Experimental
111 1sdeb/sidemind.ai

AI Assurance Metrics Analyzer - Evaluate LLM outputs with 15 quality...

19
Experimental
112 fraction12/open-rank

The open benchmark for AI agents — daily puzzles, public rankings

19
Experimental
113 greynewell/swe-bench-pro-action

GitHub Action for SWE-bench Pro evaluation powered by mcpbr

19
Experimental
114 jstilb/llm-eval-framework

LLM evaluation framework with custom metrics, LLM-as-judge, and...

19
Experimental
115 speed785/evalforge

Agent Evaluation Harness — write repeatable, measurable evals for AI agents....

19
Experimental
116 diorwave/agent-playground

A minimal sandbox to run, score, and compare AI agent outputs locally.

18
Experimental
117 pyros-projects/agent-comparison

Qualitative benchmark suite for evaluating AI coding agents and...

17
Experimental
118 The-Swarm-Corporation/Xray-Bench

XRayBench is a state-of-the-art evaluation platform designed specifically...

17
Experimental
119 axxafo/awesome-agent-benchmarks

🧠 Discover and evaluate advanced benchmark datasets for Large Language Model...

17
Experimental
120 vvsotnikov/astro-bench

Can AI agents do real science? Benchmarking AI agents on KASCADE cosmic ray...

17
Experimental
121 vectorize-io/hindsight-benchmarks

Hindsight Benchmarks Results

16
Experimental
122 Jesutofunmie/Haiku-4.5-vs-Minimax-2.1

🧠 Benchmark Haiku 4.5 and MiniMax M2.1 on agentic tasks, revealing strengths...

15
Experimental
123 josephsenior/agent-evaluation-platform

🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider...

15
Experimental
124 tostechbr/evoloop

Framework-agnostic eval toolkit for AI agents — capture traces, judge...

15
Experimental
125 BAAI-Agents/SWITCH

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in...

15
Experimental
126 crabsatellite/lem-experiments

Reproducible experiments for: LLM Exposure Monitoring — A Security Framework...

14
Experimental
127 graciegould/agent-performance-tests

Benchmarks how codebase structure affects AI agent efficiency — tool calls,...

14
Experimental
128 memstate-ai/memstate-benchmark

Open-source benchmark for AI agent memory systems — compare Memstate, mem0,...

14
Experimental
129 avdolgikh/poker-coach-eval-harness

LLM-powered evaluation harness for detecting orchestration failures in AI...

14
Experimental
130 Ritvik777/Galileo_Project

Galileo: Observations and Evals

14
Experimental
131 jamjet-labs/jamjet-benchmarks

JamJet benchmarks, migration guides, and feature comparisons vs LangGraph,...

14
Experimental
132 lintware/AI_Agent_Frameworks_Comparison

Benchmark comparing 8 AI agent frameworks (SmolAgents, OpenAI Agents SDK,...

14
Experimental
133 memvid/memvidbench

Benchmark tool for evaluating Memvid on the LoCoMo (Long-term Conversational...

14
Experimental
134 patrikmarshall/opencode-benchmark-dashboard

Measure and compare speed and accuracy of large language models using...

14
Experimental
135 Emersoft76/ai-agent-systems-advanced-benchmarking

Modular AI agent system with LLMs, tools, and benchmark optimization

12
Experimental
136 Lap-Platform/Lap-benchmark-docs

LAP benchmark results — 500 runs, 50 specs, 5 formats. Agents run 35%...

12
Experimental
137 Red1-Rahman/Prompt-Eval

Streamlit prompt evaluation tool that auto-generates test cases, run evals,...

12
Experimental
138 Software-Engineering-Arena/SWE-Agent-Arena

Compare agents pairwise via multi‑round evaluations for SE tasks.

12
Experimental
139 Jojodicus/ai-identity-benchmark

Does the identity in a system prompt change performance?

11
Experimental
140 brianjmarvin/datasnack-ai

The DataSnack AI Agent Evaluator is a CLI tool that automates the testing of...

11
Experimental
141 mohsinsheikhani/support-fte-evals

Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent...

11
Experimental
142 yzotop/ab-factory-demo

Deterministic multi-agent A/B test evaluation system with policy engine,...

11
Experimental
143 EmZod/Haiku-4.5-vs-Minimax-2.1

Systematic benchmark comparing Claude Haiku 4.5 vs MiniMax M2.1 on agentic...

11
Experimental
144 ImSudhakar07/RivalReview-Evals

An eval platform that continuously monitors the quality of the /RivalReview...

11
Experimental
145 prajaktapandit7/conversational-AI-evaluation

Structured evaluation of 30 support bot conversations measuring containment,...

11
Experimental
146 EmZod/Earth-Magnetic-Field-Research-Minimax-w-subagents-in-pi-

Multi-agent research orchestration using MiniMax-M2.1 with thinking enabled....

11
Experimental
147 codedbyelif/els-judge

Multi-LLM consensus engine for automated code review, diff analysis, and...

11
Experimental
148 abhi9avx/deepeval-llm-evaluation

LLM & RAG evaluation framework using DeepEval. Includes 11+ executable tests...

11
Experimental
149 corradocavalli/agentic_evaluation

Demonstration of testing and evaluation patterns for AI agents using Azure...

11
Experimental

Comparisons in this category