AgentBench and LLM-Agent-Benchmark-List

AgentBench

Established

LLM-Agent-Benchmark-List

Emerging

Maintenance 10/25

Adoption 10/25

Maturity 16/25

Community 19/25

Maintenance 10/25

Adoption 10/25

Maturity 16/25

Community 10/25

Stars: 3,234

Forks: 241

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

Stars: 160

Forks: 9

Downloads: —

Commits (30d): 0

Language: —

License: Apache-2.0

No Package No Dependents

About AgentBench

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

This project helps developers and researchers evaluate how well large language models (LLMs) can act as autonomous 'agents' in various real-world scenarios. It takes an LLM as input and runs it through a standardized set of tasks, like interacting with an operating system, using a database, or shopping online. The output is a performance score, showing how effectively the LLM completes these multi-step, interactive tasks.

LLM evaluation agentic AI AI research model benchmarking autonomous systems

About LLM-Agent-Benchmark-List

zhangxjohn/LLM-Agent-Benchmark-List

A banchmark list for evaluation of large language models.

This resource helps AI researchers and developers understand and compare how well Large Language Models (LLMs) and LLM-powered agents perform on different tasks. It provides a structured list of benchmarks, including papers and project pages, allowing you to select appropriate evaluation methods for specific LLM applications. This is for anyone building, researching, or deploying LLMs and agent systems who needs to rigorously assess their capabilities.

AI-research LLM-evaluation agent-development model-benchmarking natural-language-processing

Related comparisons

AgentBench and bigcodebench AgentBench and LawBench AgentBench and MemoryAgentBench AgentBench and heurigym

Scores updated daily from GitHub, PyPI, and npm data. How scores work