microsoft/benchmark-qed

Automated benchmarking of Retrieval-Augmented Generation (RAG) systems

/ 100

Established

Comprises three interconnected LLM-powered components: AutoQ synthesizes local-to-global queries across variable data scopes, AutoE performs side-by-side answer evaluation using metrics like relevance and comprehensiveness with LLM-as-a-Judge, and AutoD samples and summarizes datasets for consistent benchmarking inputs. Includes curated evaluation datasets (podcast transcripts and AP News articles) enabling reproducible RAG testing at scale without manual ground truth annotation.

No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Related tools

HZYAI/RagScore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or...

vectara/open-rag-eval

RAG evaluation without the need for "golden answers"

DocAILab/XRAG

XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced...

AIAnytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

2501Pr0ject/RAGnarok-AI

Local-first RAG evaluation framework for LLM applications. 100% local, no API keys required.

Explore RAG Tools

All categories Trending RAG directory Insights