StonyBrookNLP/appworld
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
Provides a simulated environment of 9 interconnected apps (457 APIs total) with autonomous agents of ~100 digital people, enabling agents to learn multi-step interactive coding through natural language task instructions. Supports evaluation via direct Python execution, MCP (Model Context Protocol) server integration, or pre-built agent frameworks; includes safety sandboxing for code execution and task generators for creating diverse benchmarks. Available as a PyPI package with CLI tools for task exploration, leaderboard tracking, and experiment parallelization across multiple agent implementations.
388 stars and 771 monthly downloads. Available on PyPI.
Stars
388
Forks
59
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 17, 2026
Monthly downloads
771
Commits (30d)
0
Dependencies
35
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/agents/StonyBrookNLP/appworld"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Related agents
qualifire-dev/rogue
AI Agent Evaluator & Red Team Platform
future-agi/ai-evaluation
Evaluation Framework for all your AI related Workflows
microsoft/WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...
dreadnode/AIRTBench-Code
Code Repository for: AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
agentscope-ai/OpenJudge
OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards