AISmithLab/HumanStudy-Bench

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

/ 100

Emerging

Combines an Execution Engine that reconstructs full experimental protocols from published studies with standardized evaluation metrics (Probability Alignment Score, Effect Consistency Score) to measure whether LLM agents reach identical scientific conclusions as human participants. Supports modular agent design through customizable persona and prompt presets, enabling systematic comparison of configuration choices independent of base model capabilities. Includes 12 foundational studies spanning cognition and social psychology with over 6,000 trials, plus automated tooling to add new studies from research PDFs.

No Package No Dependents

Maintenance 13 / 25

Adoption 5 / 25

Maturity 9 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

StonyBrookNLP/appworld

🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and...

qualifire-dev/rogue

AI Agent Evaluator & Red Team Platform

future-agi/ai-evaluation

Evaluation Framework for all your AI related Workflows

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of...

agentscope-ai/OpenJudge

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Explore AI Agents

All categories Trending AI Agent directory Insights