bigcodebench and AgentBench

BigCodeBench focuses specifically on evaluating code generation capabilities through programming tasks, while AgentBench evaluates LLMs across diverse agent-based reasoning tasks—making them complementary benchmarks that assess different dimensions of LLM capability (coding vs. agentic behavior) rather than direct competitors.

bigcodebench
64
Established
AgentBench
55
Established
Maintenance 6/25
Adoption 20/25
Maturity 18/25
Community 20/25
Maintenance 10/25
Adoption 10/25
Maturity 16/25
Community 19/25
Stars: 484
Forks: 64
Downloads: 18,917
Commits (30d): 0
Language: Python
License: Apache-2.0
Stars: 3,234
Forks: 241
Downloads: —
Commits (30d): 0
Language: Python
License: Apache-2.0
No risk flags
No Package No Dependents

About bigcodebench

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

About AgentBench

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Comprises 8 diverse task environments (OS interaction, database queries, knowledge graphs, web shopping/browsing, card games, and puzzles) with containerized deployment via Docker Compose. Evaluates agents through multi-turn interactions using function-calling prompts, integrated with AgentRL for end-to-end reinforcement learning workflows. Provides standardized dev/test splits with performance leaderboards across different LLM implementations.

Scores updated daily from GitHub, PyPI, and npm data. How scores work