RAG Evaluation Frameworks RAG Tools

Tools and benchmarks for assessing RAG system performance across metrics like retrieval quality, generation accuracy, and end-to-end pipeline evaluation. Does NOT include RAG implementations themselves, embedding model comparisons, or domain-specific applications.

There are 82 rag evaluation frameworks tools tracked. 3 score above 50 (established tier). The highest-rated is HZYAI/RagScore at 59/100 with 30 stars and 1,052 monthly downloads.

Get all 82 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	HZYAI/RagScore ⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in...	59	Established	30	Python
2	vectara/open-rag-eval RAG evaluation without the need for "golden answers"	52	Established	347	Python
3	2501Pr0ject/RAGnarok-AI Local-first RAG evaluation framework for LLM applications. 100% local, no...	50	Established	13	Python
4	DocAILab/XRAG XRAG: eXamining the Core - Benchmarking Foundational Component Modules in...	49	Emerging	120	Python
5	AIAnytime/rag-evaluator A library for evaluating Retrieval-Augmented Generation (RAG) systems (The...	49	Emerging	42	Python
6	microsoft/benchmark-qed Automated benchmarking of Retrieval-Augmented Generation (RAG) systems	45	Emerging	78	Python
7	nuclia/nuclia-eval Library for evaluating RAG using Nuclia's models	39	Emerging	18	Python
8	syy12335/rag-eval-scaffold Lightweight, decoupled RAG evaluation scaffold (dataset → vector store → RAG...	36	Emerging	17	Python
9	TonicAI/tonic_validate Metrics to evaluate the quality of responses of your Retrieval Augmented...	36	Emerging	324	Python
10	avnlp/rag-pipelines Advanced RAG Pipelines and Evaluation	34	Emerging	10	Python
11	AQ-MedAI/PRGB [AAAI 2026]RAG, Benchmark, robust RAG generation	33	Emerging	34	Python
12	vectara/mirage-bench Repository for Multililngual Generation, RAG evaluations, and surrogate...	32	Emerging	10	Python
13	SciPhi-AI/RAG-Performance Measuring RAG solutions throughput and latency	31	Emerging	19	Python
14	AQ-MedAI/RagQALeaderboard RAG-QA Leaderboard	30	Emerging	25	Python
15	christopherkormpos/ragret Lightweight evaluation framework for Retrieval Augmented Generation systems,...	29	Experimental	3	Python
16	gomate-community/rageval Evaluation tools for Retrieval-augmented Generation (RAG) methods.	29	Experimental	170	Python
17	RulinShao/RAG-evaluation-harnesses An evaluation suite for Retrieval-Augmented Generation (RAG).	28	Experimental	23	Python
18	RUC-NLPIR/OmniEval Open source code of the paper: "OmniEval: An Omnidirectional and Automatic...	28	Experimental	82	Python
19	sitta07/RAGScope A lightweight observability tool for visualizing and comparing RAG retrieval...	28	Experimental	2	Python
20	IAAR-Shanghai/CRUD_RAG CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented...	27	Experimental	362	Python
21	RagView/RagView We believe that every SOTA result is only valid on its own dataset. RAGView...	26	Experimental	79	—
22	TonicAI/tvallogging A tool for evaluating and tracking your RAG experiments. This repo contains...	26	Experimental	8	Python
23	GURPREETKAURJETHRA/RAG-Evaluator A library for evaluating Retrieval-Augmented Generation (RAG) systems	26	Experimental	4	Python
24	antgroup/ravig-bench Official implementation of "RAViG-Bench: A Benchmark for Retrieval-Augmented...	24	Experimental	10	Python
25	chu2bard/ragcraft End-to-end RAG pipeline with built-in evaluation metrics	24	Experimental	11	Python
26	Abanoubr/rag-eval-toolkit Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...	23	Experimental	5	Python
27	gomate-community/rag-bench RAG-Bench is to summarize all datasets used to evaluate RAG, from document...	23	Experimental	2	—
28	Ziqing110/rag-evidence-attack-lab Scientific QA robustness evaluation pipeline for evidence-missing RAG...	23	Experimental	1	Python
29	Aamirofficiall/rag-playbook Stop guessing which RAG pattern to use. Compare all 8 patterns with real...	23	Experimental	1	Python
30	rodolfboctor/rag-eval-toolkit Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...	23	Experimental	5	Python
31	Sabyasachig/ragtrace DevTools for RAG pipelines	23	Experimental	1	Python
32	AKIVA-AI/toolkit-rag-quality Deterministic RAG evaluation toolkit -- retrieval metrics (recall,...	23	Experimental	1	Python
33	EmmanuelleB985/mmeval-vrag Evaluation Framework for Multimodal RAG Systems	22	Experimental	—	Python
34	Miro96/nova-rag-benchmark Benchmark for Code RAG MCP Servers — measure how well RAG helps AI find the...	22	Experimental	—	Python
35	OpenSymbolicAI/benchmark-py-MultiHopRAG MultiHop-RAG Benchmark using GoalSeeking pattern from opensymbolicai-core	22	Experimental	—	Python
36	wigtn/wigtnOCR-v1 A research framework tA research framework to evaluate how document parsing...	22	Experimental	—	—
37	nblomerus/rag-bench RAG system for asking questions about AI/ML research papers	22	Experimental	—	Python
38	dbhavery/ragtest RAG evaluation suite — benchmark retrieval accuracy, generation quality, and...	22	Experimental	—	Python
39	srivsr/evalkit QA-grade RAG evaluation framework diagnosing retrieval, grounding,...	22	Experimental	—	Python
40	utkuakbay/RAG_Benchmark Benchmark LLMs for your RAG system - Compare Gemini, GPT, Claude & local...	22	Experimental	4	Python
41	sunilp/enterprise-rag-bench Production RAG patterns for enterprise: chunking strategies, retrieval...	22	Experimental	—	Python
42	berangerthomas/ForzaEmbed A Python framework for text embedding model evaluation and comparison	22	Experimental	—	HTML
43	itamaker/ragcheck Score retrieval runs with Precision@k, Recall@k, HitRate@k, and MRR@k.	22	Experimental	—	Go
44	amazon-science/MEMERAG MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval...	22	Experimental	4	Python
45	amazon-science/GaRAGe [ACL 2025] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation.	22	Experimental	12	—
46	rajantripathi/soas-rag-evaluation Bilingual retrieval benchmark for culturally grounded QA in English and Uzbek	22	Experimental	—	Python
47	Monke1/ragcraft 📚 Build and evaluate RAG pipelines to ingest, embed, retrieve, and answer...	22	Experimental	—	Python
48	amitk741/RAGnarok-AI 🛠️ Evaluate and benchmark your RAG pipelines locally with RAGnarok-AI—no API...	22	Experimental	—	Python
49	tarekmasryo/rag-qa-logs-corpus-data Synthetic multi-table RAG QA telemetry benchmark...	21	Experimental	2	Python
50	clouatre-labs/rag-reranking-benchmarks Supplementary benchmarks for Making Legacy Knowledge Searchable with RAG	20	Experimental	1	Python
51	hari-sherith/bayesian-rag-uncertainty RAG system with Bayesian uncertainty quantification using Beta priors and...	20	Experimental	1	Jupyter Notebook
52	foreai-co/fore The fore client package	20	Experimental	13	Python
53	oztrkoguz/RAG-Framework-Evaluation This project aims to compare different Retrieval-Augmented Generation (RAG)...	20	Experimental	14	Python
54	infrixo-systems/rag-evaluation-starter Minimal Python script to evaluate your RAG pipeline against a golden set. No...	19	Experimental	—	Python
55	anita-builds/aurora-rag-evaluation Policy-grounded assistant notes: RAG and evaluation approach	19	Experimental	—	—
56	SURESHBEEKHANI/LLMops-beginner-to-advanced Short description: RAG evaluation suite for AI Engineering Report	19	Experimental	—	Jupyter Notebook
57	antdragiotis/rag-evaluation-framework-II An evaluation example for Retrieval-Augmented Generation (RAG) that provides...	19	Experimental	—	Jupyter Notebook
58	ALucek/custom-rag-evals Applying domain specific evaluations to RAG chunking and embedding functions	19	Experimental	18	Jupyter Notebook
59	Edouard-Legoupil/rag_extraction A tutorial on how to build Summary Brief from Evaluation Report - Offline+Open Source	18	Experimental	5	HTML
60	ssisOneTeam/Korean-Embedding-Model-Performance-Benchmark-for-Retriever Korean Sentence Embedding Model Performance Benchmark for RAG	16	Experimental	50	Jupyter Notebook
61	Eustema-S-p-A/SCARF SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular...	15	Experimental	7	Python
62	fkapsahili/EntRAG EntRAG - Enterprise RAG Benchmark	15	Experimental	5	Python
63	nidhip1611/GroundedGeo A Benchmark for Citation-Grounded Geographic QA	15	Experimental	—	TeX
64	daniel-e-alarcon/rag-explorer Local-first RAG application with retrieval evaluation (hit@k, MRR) and...	15	Experimental	—	Python
65	yashk1103/Enhanced-Multi-Turn-RAG-Benchmark-Framework Comprehensive benchmarking framework for evaluating 13+ embedding models on...	15	Experimental	—	Python
66	shaadclt/EvalRAG A comprehensive evaluation toolkit for assessing Retrieval-Augmented...	14	Experimental	4	Python
67	iom/evaluation_knowledge A module to turn Evaluation Reports into AI knowledge	14	Experimental	—	HTML
68	rubsj/ai-rag-evaluation-framework RAG pipeline evaluation framework with RAGAS metrics and statistical bias correction	14	Experimental	—	Python
69	Hyeongseob91/research-vlm-based-document-parsing A research framework tA research framework to evaluate how document parsing...	14	Experimental	—	Python
70	NamaWho/pyterrier-nuggetizer Nuggetizer: A PyTerrier Open-Source Framework for Evaluating...	13	Experimental	2	Python
71	tsdata/ranx-k Korean-optimized RAG evaluation toolkit with Kiwi tokenizer, ROUGE metrics, ...	13	Experimental	2	Python
72	c21051997/ragscope 🏆 An open-source library for the comprehensive, end-to-end evaluation of RAG...	13	Experimental	2	Python
73	ash-hun/BERGEN-UP E2E Evaluation Pipeline for ONLY RAG. Benchmark to BERGEN from NAVER Labs...	12	Experimental	1	Python
74	sumit9000/Deep-Evaluation_Rag The Deep Evaluation notebook helps you understand how well your machine...	11	Experimental	—	Jupyter Notebook
75	chandana999/retrieval-evaluation-api RAG retrieval evaluation tool with RAGAS. Compare 6 retriever strategies...	11	Experimental	—	Jupyter Notebook
76	beingdutta/Self-Refining-Lecture-RAG-For-Educational-Videos Lecture-RAG is a grounding-aware Video-RAG framework that reduces...	11	Experimental	—	Jupyter Notebook
77	labofone/rag-eval Reference-free evaluation of Retrieval-Augmented Generation (RAG) pipelines.	11	Experimental	—	Python
78	JhaAyush01/SEMALEX A comprehensive RAG Evaluation Metric designed to measure the weighted...	11	Experimental	2	Python
79	hideyuki001/research-rag-instruction-pack Research & Education oriented LangChain RAG framework (5P Principles + EUQS...	11	Experimental	—	Python
80	alp-oz/rag-metrics RAG-Metrics: A modular framework for evaluating Retrieval-Augmented...	11	Experimental	—	Python
81	Mizokuiam/rag-eval-kit A lightweight, modular Python toolkit for evaluating and benchmarking...	11	Experimental	2	Python
82	i-partalas/industrial-rag-qna-benchmark Benchmarking the performance of proprietary vs open-source LLMs in...	10	Experimental	1	Python

Comparisons in this category

open-rag-eval and rageval (52 vs 29) open-rag-eval and rag-evaluator (52 vs 49) open-rag-eval and RAG-evaluation-harnesses (52 vs 28) rag-evaluator and rageval (49 vs 29) rageval and RAG-evaluation-harnesses (29 vs 28) rag-evaluator and RAG-evaluation-harnesses (49 vs 28)