IAAR-Shanghai/CRUD_RAG

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

/ 100

Emerging

Evaluates RAG systems across four distinct tasks—Create, Read, Update, Delete—using 80,000+ Chinese news documents as a retrieval corpus and Milvus vector database for indexing. Implements multiple evaluation metrics including BLEU, ROUGE, BERTScore, and RAGQuestEval (which leverages GPT for question generation and answering). Supports flexible LLM integration through modular APIs for GPT models, locally-deployed instances, and remote endpoints, with configurable retrieval parameters and prompt templates optimized for different model scales.

362 stars. No commits in the last 6 months.

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 14 / 25

How are scores calculated?

Stars

362

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

HZYAI/RagScore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or...

vectara/open-rag-eval

RAG evaluation without the need for "golden answers"

DocAILab/XRAG

XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced...

AIAnytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

microsoft/benchmark-qed

Automated benchmarking of Retrieval-Augmented Generation (RAG) systems

Explore RAG Tools

All categories Trending RAG directory Insights