ag2ai/Agents_Failure_Attribution
Benchmark for automated failure attributions in agentic systems (🏆 ICML 2025 Spotlight)
Introduces the "Who&When" benchmark with 184 annotated failure trajectories from both algorithm-generated (CaptainAgent) and hand-crafted (Magnetic-One) multi-agent systems, providing fine-grained labels for responsible agents, critical error steps, and failure explanations. Implements three attribution methods—All-at-Once, Step-by-Step, and Binary Search—that work with multiple LLM backends (GPT-4o, Llama, Qwen) to automatically pinpoint failure causes in complex agentic workflows. Evaluates performance on realistic scenarios derived from GAIA and AssistantBench datasets, enabling rapid debugging iteration and reward signals for agent self-correction.
349 stars.
Stars
349
Forks
23
Language
Python
License
MIT
Category
Last pushed
Feb 11, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/ag2ai/Agents_Failure_Attribution"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
agentscope-ai/OpenJudge
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards