RAG Evaluation Frameworks RAG Tools

Tools and benchmarks for assessing RAG system performance across metrics like retrieval quality, generation accuracy, and end-to-end pipeline evaluation. Does NOT include RAG implementations themselves, embedding model comparisons, or domain-specific applications.

There are 82 rag evaluation frameworks tools tracked. 3 score above 50 (established tier). The highest-rated is HZYAI/RagScore at 59/100 with 30 stars and 1,052 monthly downloads.

Get all 82 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-frameworks&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 HZYAI/RagScore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in...

59
Established
2 vectara/open-rag-eval

RAG evaluation without the need for "golden answers"

52
Established
3 2501Pr0ject/RAGnarok-AI

Local-first RAG evaluation framework for LLM applications. 100% local, no...

50
Established
4 DocAILab/XRAG

XRAG: eXamining the Core - Benchmarking Foundational Component Modules in...

49
Emerging
5 AIAnytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The...

49
Emerging
6 microsoft/benchmark-qed

Automated benchmarking of Retrieval-Augmented Generation (RAG) systems

45
Emerging
7 nuclia/nuclia-eval

Library for evaluating RAG using Nuclia's models

39
Emerging
8 syy12335/rag-eval-scaffold

Lightweight, decoupled RAG evaluation scaffold (dataset → vector store → RAG...

36
Emerging
9 TonicAI/tonic_validate

Metrics to evaluate the quality of responses of your Retrieval Augmented...

36
Emerging
10 avnlp/rag-pipelines

Advanced RAG Pipelines and Evaluation

34
Emerging
11 AQ-MedAI/PRGB

[AAAI 2026]RAG, Benchmark, robust RAG generation

33
Emerging
12 vectara/mirage-bench

Repository for Multililngual Generation, RAG evaluations, and surrogate...

32
Emerging
13 SciPhi-AI/RAG-Performance

Measuring RAG solutions throughput and latency

31
Emerging
14 AQ-MedAI/RagQALeaderboard

RAG-QA Leaderboard

30
Emerging
15 christopherkormpos/ragret

Lightweight evaluation framework for Retrieval Augmented Generation systems,...

29
Experimental
16 gomate-community/rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.

29
Experimental
17 RulinShao/RAG-evaluation-harnesses

An evaluation suite for Retrieval-Augmented Generation (RAG).

28
Experimental
18 RUC-NLPIR/OmniEval

Open source code of the paper: "OmniEval: An Omnidirectional and Automatic...

28
Experimental
19 sitta07/RAGScope

A lightweight observability tool for visualizing and comparing RAG retrieval...

28
Experimental
20 IAAR-Shanghai/CRUD_RAG

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented...

27
Experimental
21 RagView/RagView

We believe that every SOTA result is only valid on its own dataset. RAGView...

26
Experimental
22 TonicAI/tvallogging

A tool for evaluating and tracking your RAG experiments. This repo contains...

26
Experimental
23 GURPREETKAURJETHRA/RAG-Evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems

26
Experimental
24 antgroup/ravig-bench

Official implementation of "RAViG-Bench: A Benchmark for Retrieval-Augmented...

24
Experimental
25 chu2bard/ragcraft

End-to-end RAG pipeline with built-in evaluation metrics

24
Experimental
26 Abanoubr/rag-eval-toolkit

Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...

23
Experimental
27 gomate-community/rag-bench

RAG-Bench is to summarize all datasets used to evaluate RAG, from document...

23
Experimental
28 Ziqing110/rag-evidence-attack-lab

Scientific QA robustness evaluation pipeline for evidence-missing RAG...

23
Experimental
29 Aamirofficiall/rag-playbook

Stop guessing which RAG pattern to use. Compare all 8 patterns with real...

23
Experimental
30 rodolfboctor/rag-eval-toolkit

Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for...

23
Experimental
31 Sabyasachig/ragtrace

DevTools for RAG pipelines

23
Experimental
32 AKIVA-AI/toolkit-rag-quality

Deterministic RAG evaluation toolkit -- retrieval metrics (recall,...

23
Experimental
33 EmmanuelleB985/mmeval-vrag

Evaluation Framework for Multimodal RAG Systems

22
Experimental
34 Miro96/nova-rag-benchmark

Benchmark for Code RAG MCP Servers — measure how well RAG helps AI find the...

22
Experimental
35 OpenSymbolicAI/benchmark-py-MultiHopRAG

MultiHop-RAG Benchmark using GoalSeeking pattern from opensymbolicai-core

22
Experimental
36 wigtn/wigtnOCR-v1

A research framework tA research framework to evaluate how document parsing...

22
Experimental
37 nblomerus/rag-bench

RAG system for asking questions about AI/ML research papers

22
Experimental
38 dbhavery/ragtest

RAG evaluation suite — benchmark retrieval accuracy, generation quality, and...

22
Experimental
39 srivsr/evalkit

QA-grade RAG evaluation framework diagnosing retrieval, grounding,...

22
Experimental
40 utkuakbay/RAG_Benchmark

Benchmark LLMs for your RAG system - Compare Gemini, GPT, Claude & local...

22
Experimental
41 sunilp/enterprise-rag-bench

Production RAG patterns for enterprise: chunking strategies, retrieval...

22
Experimental
42 berangerthomas/ForzaEmbed

A Python framework for text embedding model evaluation and comparison

22
Experimental
43 itamaker/ragcheck

Score retrieval runs with Precision@k, Recall@k, HitRate@k, and MRR@k.

22
Experimental
44 amazon-science/MEMERAG

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval...

22
Experimental
45 amazon-science/GaRAGe

[ACL 2025] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation.

22
Experimental
46 rajantripathi/soas-rag-evaluation

Bilingual retrieval benchmark for culturally grounded QA in English and Uzbek

22
Experimental
47 Monke1/ragcraft

📚 Build and evaluate RAG pipelines to ingest, embed, retrieve, and answer...

22
Experimental
48 amitk741/RAGnarok-AI

🛠️ Evaluate and benchmark your RAG pipelines locally with RAGnarok-AI—no API...

22
Experimental
49 tarekmasryo/rag-qa-logs-corpus-data

Synthetic multi-table RAG QA telemetry benchmark...

21
Experimental
50 clouatre-labs/rag-reranking-benchmarks

Supplementary benchmarks for Making Legacy Knowledge Searchable with RAG

20
Experimental
51 hari-sherith/bayesian-rag-uncertainty

RAG system with Bayesian uncertainty quantification using Beta priors and...

20
Experimental
52 foreai-co/fore

The fore client package

20
Experimental
53 oztrkoguz/RAG-Framework-Evaluation

This project aims to compare different Retrieval-Augmented Generation (RAG)...

20
Experimental
54 infrixo-systems/rag-evaluation-starter

Minimal Python script to evaluate your RAG pipeline against a golden set. No...

19
Experimental
55 anita-builds/aurora-rag-evaluation

Policy-grounded assistant notes: RAG and evaluation approach

19
Experimental
56 SURESHBEEKHANI/LLMops-beginner-to-advanced

Short description: RAG evaluation suite for AI Engineering Report

19
Experimental
57 antdragiotis/rag-evaluation-framework-II

An evaluation example for Retrieval-Augmented Generation (RAG) that provides...

19
Experimental
58 ALucek/custom-rag-evals

Applying domain specific evaluations to RAG chunking and embedding functions

19
Experimental
59 Edouard-Legoupil/rag_extraction

A tutorial on how to build Summary Brief from Evaluation Report - Offline+Open Source

18
Experimental
60 ssisOneTeam/Korean-Embedding-Model-Performance-Benchmark-for-Retriever

Korean Sentence Embedding Model Performance Benchmark for RAG

16
Experimental
61 Eustema-S-p-A/SCARF

SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular...

15
Experimental
62 fkapsahili/EntRAG

EntRAG - Enterprise RAG Benchmark

15
Experimental
63 nidhip1611/GroundedGeo

A Benchmark for Citation-Grounded Geographic QA

15
Experimental
64 daniel-e-alarcon/rag-explorer

Local-first RAG application with retrieval evaluation (hit@k, MRR) and...

15
Experimental
65 yashk1103/Enhanced-Multi-Turn-RAG-Benchmark-Framework

Comprehensive benchmarking framework for evaluating 13+ embedding models on...

15
Experimental
66 shaadclt/EvalRAG

A comprehensive evaluation toolkit for assessing Retrieval-Augmented...

14
Experimental
67 iom/evaluation_knowledge

A module to turn Evaluation Reports into AI knowledge

14
Experimental
68 rubsj/ai-rag-evaluation-framework

RAG pipeline evaluation framework with RAGAS metrics and statistical bias correction

14
Experimental
69 Hyeongseob91/research-vlm-based-document-parsing

A research framework tA research framework to evaluate how document parsing...

14
Experimental
70 NamaWho/pyterrier-nuggetizer

Nuggetizer: A PyTerrier Open-Source Framework for Evaluating...

13
Experimental
71 tsdata/ranx-k

Korean-optimized RAG evaluation toolkit with Kiwi tokenizer, ROUGE metrics, ...

13
Experimental
72 c21051997/ragscope

🏆 An open-source library for the comprehensive, end-to-end evaluation of RAG...

13
Experimental
73 ash-hun/BERGEN-UP

E2E Evaluation Pipeline for ONLY RAG. Benchmark to BERGEN from NAVER Labs...

12
Experimental
74 sumit9000/Deep-Evaluation_Rag

The Deep Evaluation notebook helps you understand how well your machine...

11
Experimental
75 chandana999/retrieval-evaluation-api

RAG retrieval evaluation tool with RAGAS. Compare 6 retriever strategies...

11
Experimental
76 beingdutta/Self-Refining-Lecture-RAG-For-Educational-Videos

Lecture-RAG is a grounding-aware Video-RAG framework that reduces...

11
Experimental
77 labofone/rag-eval

Reference-free evaluation of Retrieval-Augmented Generation (RAG) pipelines.

11
Experimental
78 JhaAyush01/SEMALEX

A comprehensive RAG Evaluation Metric designed to measure the weighted...

11
Experimental
79 hideyuki001/research-rag-instruction-pack

Research & Education oriented LangChain RAG framework (5P Principles + EUQS...

11
Experimental
80 alp-oz/rag-metrics

RAG-Metrics: A modular framework for evaluating Retrieval-Augmented...

11
Experimental
81 Mizokuiam/rag-eval-kit

A lightweight, modular Python toolkit for evaluating and benchmarking...

11
Experimental
82 i-partalas/industrial-rag-qna-benchmark

Benchmarking the performance of proprietary vs open-source LLMs in...

10
Experimental