RAG Evaluation Frameworks RAG Tools
Tools and benchmarks for assessing RAG system performance across metrics like retrieval quality, generation accuracy, and end-to-end pipeline evaluation. Does NOT include RAG implementations themselves, embedding model comparisons, or domain-specific applications.
There are 82 rag evaluation frameworks tools tracked. 3 score above 50 (established tier). The highest-rated is HZYAI/RagScore at 59/100 with 30 stars and 1,052 monthly downloads.
Get all 82 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-frameworks&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
HZYAI/RagScore
⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in... |
|
Established |
| 2 |
vectara/open-rag-eval
RAG evaluation without the need for "golden answers" |
|
Established |
| 3 |
2501Pr0ject/RAGnarok-AI
Local-first RAG evaluation framework for LLM applications. 100% local, no... |
|
Established |
| 4 |
DocAILab/XRAG
XRAG: eXamining the Core - Benchmarking Foundational Component Modules in... |
|
Emerging |
| 5 |
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The... |
|
Emerging |
| 6 |
microsoft/benchmark-qed
Automated benchmarking of Retrieval-Augmented Generation (RAG) systems |
|
Emerging |
| 7 |
nuclia/nuclia-eval
Library for evaluating RAG using Nuclia's models |
|
Emerging |
| 8 |
syy12335/rag-eval-scaffold
Lightweight, decoupled RAG evaluation scaffold (dataset → vector store → RAG... |
|
Emerging |
| 9 |
TonicAI/tonic_validate
Metrics to evaluate the quality of responses of your Retrieval Augmented... |
|
Emerging |
| 10 |
avnlp/rag-pipelines
Advanced RAG Pipelines and Evaluation |
|
Emerging |
| 11 |
AQ-MedAI/PRGB
[AAAI 2026]RAG, Benchmark, robust RAG generation |
|
Emerging |
| 12 |
vectara/mirage-bench
Repository for Multililngual Generation, RAG evaluations, and surrogate... |
|
Emerging |
| 13 |
SciPhi-AI/RAG-Performance
Measuring RAG solutions throughput and latency |
|
Emerging |
| 14 |
AQ-MedAI/RagQALeaderboard
RAG-QA Leaderboard |
|
Emerging |
| 15 |
christopherkormpos/ragret
Lightweight evaluation framework for Retrieval Augmented Generation systems,... |
|
Experimental |
| 16 |
gomate-community/rageval
Evaluation tools for Retrieval-augmented Generation (RAG) methods. |
|
Experimental |
| 17 |
RulinShao/RAG-evaluation-harnesses
An evaluation suite for Retrieval-Augmented Generation (RAG). |
|
Experimental |
| 18 |
RUC-NLPIR/OmniEval
Open source code of the paper: "OmniEval: An Omnidirectional and Automatic... |
|
Experimental |
| 19 |
sitta07/RAGScope
A lightweight observability tool for visualizing and comparing RAG retrieval... |
|
Experimental |
| 20 |
IAAR-Shanghai/CRUD_RAG
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented... |
|
Experimental |
| 21 |
RagView/RagView
We believe that every SOTA result is only valid on its own dataset. RAGView... |
|
Experimental |
| 22 |
TonicAI/tvallogging
A tool for evaluating and tracking your RAG experiments. This repo contains... |
|
Experimental |
| 23 |
GURPREETKAURJETHRA/RAG-Evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems |
|
Experimental |
| 24 |
antgroup/ravig-bench
Official implementation of "RAViG-Bench: A Benchmark for Retrieval-Augmented... |
|
Experimental |
| 25 |
chu2bard/ragcraft
End-to-end RAG pipeline with built-in evaluation metrics |
|
Experimental |
| 26 |
Abanoubr/rag-eval-toolkit
Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for... |
|
Experimental |
| 27 |
gomate-community/rag-bench
RAG-Bench is to summarize all datasets used to evaluate RAG, from document... |
|
Experimental |
| 28 |
Ziqing110/rag-evidence-attack-lab
Scientific QA robustness evaluation pipeline for evidence-missing RAG... |
|
Experimental |
| 29 |
Aamirofficiall/rag-playbook
Stop guessing which RAG pattern to use. Compare all 8 patterns with real... |
|
Experimental |
| 30 |
rodolfboctor/rag-eval-toolkit
Open-source Python toolkit for evaluating RAG pipelines. LLM-as-judge for... |
|
Experimental |
| 31 |
Sabyasachig/ragtrace
DevTools for RAG pipelines |
|
Experimental |
| 32 |
AKIVA-AI/toolkit-rag-quality
Deterministic RAG evaluation toolkit -- retrieval metrics (recall,... |
|
Experimental |
| 33 |
EmmanuelleB985/mmeval-vrag
Evaluation Framework for Multimodal RAG Systems |
|
Experimental |
| 34 |
Miro96/nova-rag-benchmark
Benchmark for Code RAG MCP Servers — measure how well RAG helps AI find the... |
|
Experimental |
| 35 |
OpenSymbolicAI/benchmark-py-MultiHopRAG
MultiHop-RAG Benchmark using GoalSeeking pattern from opensymbolicai-core |
|
Experimental |
| 36 |
wigtn/wigtnOCR-v1
A research framework tA research framework to evaluate how document parsing... |
|
Experimental |
| 37 |
nblomerus/rag-bench
RAG system for asking questions about AI/ML research papers |
|
Experimental |
| 38 |
dbhavery/ragtest
RAG evaluation suite — benchmark retrieval accuracy, generation quality, and... |
|
Experimental |
| 39 |
srivsr/evalkit
QA-grade RAG evaluation framework diagnosing retrieval, grounding,... |
|
Experimental |
| 40 |
utkuakbay/RAG_Benchmark
Benchmark LLMs for your RAG system - Compare Gemini, GPT, Claude & local... |
|
Experimental |
| 41 |
sunilp/enterprise-rag-bench
Production RAG patterns for enterprise: chunking strategies, retrieval... |
|
Experimental |
| 42 |
berangerthomas/ForzaEmbed
A Python framework for text embedding model evaluation and comparison |
|
Experimental |
| 43 |
itamaker/ragcheck
Score retrieval runs with Precision@k, Recall@k, HitRate@k, and MRR@k. |
|
Experimental |
| 44 |
amazon-science/MEMERAG
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval... |
|
Experimental |
| 45 |
amazon-science/GaRAGe
[ACL 2025] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation. |
|
Experimental |
| 46 |
rajantripathi/soas-rag-evaluation
Bilingual retrieval benchmark for culturally grounded QA in English and Uzbek |
|
Experimental |
| 47 |
Monke1/ragcraft
📚 Build and evaluate RAG pipelines to ingest, embed, retrieve, and answer... |
|
Experimental |
| 48 |
amitk741/RAGnarok-AI
🛠️ Evaluate and benchmark your RAG pipelines locally with RAGnarok-AI—no API... |
|
Experimental |
| 49 |
tarekmasryo/rag-qa-logs-corpus-data
Synthetic multi-table RAG QA telemetry benchmark... |
|
Experimental |
| 50 |
clouatre-labs/rag-reranking-benchmarks
Supplementary benchmarks for Making Legacy Knowledge Searchable with RAG |
|
Experimental |
| 51 |
hari-sherith/bayesian-rag-uncertainty
RAG system with Bayesian uncertainty quantification using Beta priors and... |
|
Experimental |
| 52 |
foreai-co/fore
The fore client package |
|
Experimental |
| 53 |
oztrkoguz/RAG-Framework-Evaluation
This project aims to compare different Retrieval-Augmented Generation (RAG)... |
|
Experimental |
| 54 |
infrixo-systems/rag-evaluation-starter
Minimal Python script to evaluate your RAG pipeline against a golden set. No... |
|
Experimental |
| 55 |
anita-builds/aurora-rag-evaluation
Policy-grounded assistant notes: RAG and evaluation approach |
|
Experimental |
| 56 |
SURESHBEEKHANI/LLMops-beginner-to-advanced
Short description: RAG evaluation suite for AI Engineering Report |
|
Experimental |
| 57 |
antdragiotis/rag-evaluation-framework-II
An evaluation example for Retrieval-Augmented Generation (RAG) that provides... |
|
Experimental |
| 58 |
ALucek/custom-rag-evals
Applying domain specific evaluations to RAG chunking and embedding functions |
|
Experimental |
| 59 |
Edouard-Legoupil/rag_extraction
A tutorial on how to build Summary Brief from Evaluation Report - Offline+Open Source |
|
Experimental |
| 60 |
ssisOneTeam/Korean-Embedding-Model-Performance-Benchmark-for-Retriever
Korean Sentence Embedding Model Performance Benchmark for RAG |
|
Experimental |
| 61 |
Eustema-S-p-A/SCARF
SCARF (System for Comprehensive Assessment of RAG Frameworks) is a modular... |
|
Experimental |
| 62 |
fkapsahili/EntRAG
EntRAG - Enterprise RAG Benchmark |
|
Experimental |
| 63 |
nidhip1611/GroundedGeo
A Benchmark for Citation-Grounded Geographic QA |
|
Experimental |
| 64 |
daniel-e-alarcon/rag-explorer
Local-first RAG application with retrieval evaluation (hit@k, MRR) and... |
|
Experimental |
| 65 |
yashk1103/Enhanced-Multi-Turn-RAG-Benchmark-Framework
Comprehensive benchmarking framework for evaluating 13+ embedding models on... |
|
Experimental |
| 66 |
shaadclt/EvalRAG
A comprehensive evaluation toolkit for assessing Retrieval-Augmented... |
|
Experimental |
| 67 |
iom/evaluation_knowledge
A module to turn Evaluation Reports into AI knowledge |
|
Experimental |
| 68 |
rubsj/ai-rag-evaluation-framework
RAG pipeline evaluation framework with RAGAS metrics and statistical bias correction |
|
Experimental |
| 69 |
Hyeongseob91/research-vlm-based-document-parsing
A research framework tA research framework to evaluate how document parsing... |
|
Experimental |
| 70 |
NamaWho/pyterrier-nuggetizer
Nuggetizer: A PyTerrier Open-Source Framework for Evaluating... |
|
Experimental |
| 71 |
tsdata/ranx-k
Korean-optimized RAG evaluation toolkit with Kiwi tokenizer, ROUGE metrics, ... |
|
Experimental |
| 72 |
c21051997/ragscope
🏆 An open-source library for the comprehensive, end-to-end evaluation of RAG... |
|
Experimental |
| 73 |
ash-hun/BERGEN-UP
E2E Evaluation Pipeline for ONLY RAG. Benchmark to BERGEN from NAVER Labs... |
|
Experimental |
| 74 |
sumit9000/Deep-Evaluation_Rag
The Deep Evaluation notebook helps you understand how well your machine... |
|
Experimental |
| 75 |
chandana999/retrieval-evaluation-api
RAG retrieval evaluation tool with RAGAS. Compare 6 retriever strategies... |
|
Experimental |
| 76 |
beingdutta/Self-Refining-Lecture-RAG-For-Educational-Videos
Lecture-RAG is a grounding-aware Video-RAG framework that reduces... |
|
Experimental |
| 77 |
labofone/rag-eval
Reference-free evaluation of Retrieval-Augmented Generation (RAG) pipelines. |
|
Experimental |
| 78 |
JhaAyush01/SEMALEX
A comprehensive RAG Evaluation Metric designed to measure the weighted... |
|
Experimental |
| 79 |
hideyuki001/research-rag-instruction-pack
Research & Education oriented LangChain RAG framework (5P Principles + EUQS... |
|
Experimental |
| 80 |
alp-oz/rag-metrics
RAG-Metrics: A modular framework for evaluating Retrieval-Augmented... |
|
Experimental |
| 81 |
Mizokuiam/rag-eval-kit
A lightweight, modular Python toolkit for evaluating and benchmarking... |
|
Experimental |
| 82 |
i-partalas/industrial-rag-qna-benchmark
Benchmarking the performance of proprietary vs open-source LLMs in... |
|
Experimental |