microsoftarchive/promptbench
A unified evaluation framework for large language models
ArchivedProvides modular support for prompt engineering techniques (few-shot chain-of-thought, emotion prompting), adversarial robustness evaluation via prompt attacks, and dynamic test data generation to mitigate contamination. Built on PyTorch with extensible components for datasets, models, and evaluation methods, integrating specialized frameworks like DyVal for dynamic evaluation and PromptEval for efficient multi-prompt assessment across standard benchmarks (MMLU, BigBench Hard, GLUE) and multi-modal datasets.
2,787 stars.
Stars
2,787
Forks
219
Language
Python
License
MIT
Category
Last pushed
Feb 20, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/prompt-engineering/microsoftarchive/promptbench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
microsoft/promptbench
A unified evaluation framework for large language models
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications....
gabe-mousa/Apolien
AI Safety Evaluation Library
babelcloud/LLM-RGB
LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.
PromptMixerDev/prompt-mixer-app-ce
A desktop application for comparing outputs from different Large Language Models (LLMs).