AgentBench and MemoryAgentBench

These two tools are complements, with MemoryAgentBench specifically extending AgentBench by focusing on the specialized evaluation of memory capabilities in LLM agents through incremental multi-turn interactions.

AgentBench
55
Established
MemoryAgentBench
46
Emerging
Maintenance 10/25
Adoption 10/25
Maturity 16/25
Community 19/25
Maintenance 10/25
Adoption 10/25
Maturity 7/25
Community 19/25
Stars: 3,234
Forks: 241
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
Stars: 253
Forks: 41
Downloads:
Commits (30d): 0
Language: Python
License:
No Package No Dependents
No License No Package No Dependents

About AgentBench

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Comprises 8 diverse task environments (OS interaction, database queries, knowledge graphs, web shopping/browsing, card games, and puzzles) with containerized deployment via Docker Compose. Evaluates agents through multi-turn interactions using function-calling prompts, integrated with AgentRL for end-to-end reinforcement learning workflows. Provides standardized dev/test splits with performance leaderboards across different LLM implementations.

About MemoryAgentBench

HUST-AI-HYZ/MemoryAgentBench

Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Scores updated daily from GitHub, PyPI, and npm data. How scores work