microsoft/benchmark-qed

Automated benchmarking of Retrieval-Augmented Generation (RAG) systems

52
/ 100
Established

Comprises three interconnected LLM-powered components: AutoQ synthesizes local-to-global queries across variable data scopes, AutoE performs side-by-side answer evaluation using metrics like relevance and comprehensiveness with LLM-as-a-Judge, and AutoD samples and summarizes datasets for consistent benchmarking inputs. Includes curated evaluation datasets (podcast transcripts and AP News articles) enabling reproducible RAG testing at scale without manual ground truth annotation.

No Package No Dependents
Maintenance 10 / 25
Adoption 9 / 25
Maturity 16 / 25
Community 17 / 25

How are scores calculated?

Stars

78

Forks

14

Language

Python

License

MIT

Last pushed

Mar 04, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/microsoft/benchmark-qed"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.