ryoungj/ToolEmu

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

/ 100

Emerging

Uses LLMs (e.g., GPT-4) to emulate tool execution in a virtual sandbox without requiring actual API implementations, enabling rapid prototyping across diverse scenarios including high-stakes tools. Includes automated LM-based safety and helpfulness evaluators for scalable risk assessment, paired with a curated benchmark of 36 toolkits and 144 test cases for quantitative agent evaluation. Extensible architecture allows users to contribute new toolkits and test cases by specifying tool schemas and scenarios.

192 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 9 / 25

Community 14 / 25

How are scores calculated?

Stars

192

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

microsoft/promptbench

A unified evaluation framework for large language models

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications....

microsoftarchive/promptbench

A unified evaluation framework for large language models

gabe-mousa/Apolien

AI Safety Evaluation Library

levitation-opensource/Manipulative-Expression-Recognition

MER is a software that identifies and highlights manipulative communication in text from human...

Explore Prompt Engineering Tools

All categories Trending Prompt Engineering directory Insights