microsoft/benchmark-qed
Automated benchmarking of Retrieval-Augmented Generation (RAG) systems
Comprises three interconnected LLM-powered components: AutoQ synthesizes local-to-global queries across variable data scopes, AutoE performs side-by-side answer evaluation using metrics like relevance and comprehensiveness with LLM-as-a-Judge, and AutoD samples and summarizes datasets for consistent benchmarking inputs. Includes curated evaluation datasets (podcast transcripts and AP News articles) enabling reproducible RAG testing at scale without manual ground truth annotation.
Stars
78
Forks
14
Language
Python
License
MIT
Category
Last pushed
Mar 04, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/microsoft/benchmark-qed"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Related tools
HZYAI/RagScore
⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or...
vectara/open-rag-eval
RAG evaluation without the need for "golden answers"
DocAILab/XRAG
XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced...
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
2501Pr0ject/RAGnarok-AI
Local-first RAG evaluation framework for LLM applications. 100% local, no API keys required.