tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

/ 100

Established

Implements length-controlled win-rate scoring to mitigate output length bias, achieving 0.98 Spearman correlation with ChatBot Arena while costing under $10 and completing in under 3 minutes. Uses LLM-based pairwise comparisons (GPT-4 by default) against a reference model, validated against 20K human annotations with built-in caching and output randomization. Provides a toolkit for constructing custom evaluators with batching and multi-annotator support, plus curated evaluation datasets and leaderboards for benchmarking instruction-following models.

1,957 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

1,957

Forks

305

Language

Jupyter Notebook

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Related models

eth-sri/matharena

Evaluation of LLMs on latest math competitions

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

princeton-nlp/LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following

Explore Transformer Models

All categories Trending Transformer directory Insights