tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

51
/ 100
Established

Implements length-controlled win-rate scoring to mitigate output length bias, achieving 0.98 Spearman correlation with ChatBot Arena while costing under $10 and completing in under 3 minutes. Uses LLM-based pairwise comparisons (GPT-4 by default) against a reference model, validated against 20K human annotations with built-in caching and output randomization. Provides a toolkit for constructing custom evaluators with batching and multi-annotator support, plus curated evaluation datasets and leaderboards for benchmarking instruction-following models.

1,957 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 23 / 25

How are scores calculated?

Stars

1,957

Forks

305

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

Aug 09, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/tatsu-lab/alpaca_eval"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.