IAAR-Shanghai/CRUD_RAG
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
Evaluates RAG systems across four distinct tasks—Create, Read, Update, Delete—using 80,000+ Chinese news documents as a retrieval corpus and Milvus vector database for indexing. Implements multiple evaluation metrics including BLEU, ROUGE, BERTScore, and RAGQuestEval (which leverages GPT for question generation and answering). Supports flexible LLM integration through modular APIs for GPT models, locally-deployed instances, and remote endpoints, with configurable retrieval parameters and prompt templates optimized for different model scales.
362 stars. No commits in the last 6 months.
Stars
362
Forks
28
Language
Python
License
—
Category
Last pushed
May 20, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/IAAR-Shanghai/CRUD_RAG"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
HZYAI/RagScore
⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or...
vectara/open-rag-eval
RAG evaluation without the need for "golden answers"
DocAILab/XRAG
XRAG: eXamining the Core - Benchmarking Foundational Component Modules in Advanced...
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
microsoft/benchmark-qed
Automated benchmarking of Retrieval-Augmented Generation (RAG) systems