RAG Evaluation Benchmarking RAG Tools

Frameworks and tools for evaluating, benchmarking, and scoring RAG systems, LLMs, and prompts through automated testing, metrics, and judge-based assessment. Does NOT include general observability/monitoring, hallucination detection as standalone tools, or prompt engineering platforms.

There are 32 rag evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is modelscope/evalscope at 83/100 with 2,501 stars and 29,097 monthly downloads. 1 of the top 10 are actively maintained.

Get all 32 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	modelscope/evalscope A streamlined and customizable framework for efficient large model (LLM,...	83	Verified	2,501	Python
2	Kareem-Rashed/rubric-eval Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.	52	Established	5	Python
3	izam-mohammed/ragrank 🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts,...	45	Emerging	45	Python
4	justplus/llm-eval 大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。	39	Emerging	82	Python
5	dokimos-dev/dokimos Evaluation Framework for LLM applications in Java and Kotlin	37	Emerging	18	Java
6	cleanlab/tlm Score the trustworthiness of outputs from any LLM in real-time	36	Emerging	4	Python
7	relari-ai/continuous-eval Data-Driven Evaluation for LLM-Powered Applications	34	Emerging	516	Python
8	ccmbioinfo/ccm_benchmate A knowledge aggregator for computational biology	32	Emerging	5	Python
9	Addepto/contextcheck MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via...	31	Emerging	91	Python
10	zjunlp/InnoEval InnoEval: On Research Idea Evaluation as a Knowledge-Grounded,...	30	Emerging	16	Python
11	kolenaIO/autoarena Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation	30	Emerging	108	TypeScript
12	amitbad/llm-evaluation Hands-on LLM evaluation learning repo — local models via Ollama, no paid...	29	Experimental	3	HTML
13	saikiranAnnam/TraceDog TraceDog — LLM Evaluation and AI observability platform to trace, evaluate,...	24	Experimental	2	TypeScript
14	piog/llm-eval-bench Evaluation harness for prompts, structured outputs, and RAG workflows	22	Experimental	—	Python
15	wwx99921/llm-rank Rank passages using BM25 and large language models to improve retrieval...	22	Experimental	—	C++
16	Lengi96/ai-qa-framework Requirements-driven AI/LLM QA framework with traceability, release gates,...	22	Experimental	—	Python
17	honeyhiveai/realign Realign is a testing and simulation framework for AI applications.	20	Experimental	18	Python
18	dariero/RagaliQ LLM & RAG evaluation testing framework✨	20	Experimental	1	Python
19	dataaispark-spec/TrustScoreEval TrustScoreEval: Trust Scores for AI/LLM Responses — Detect hallucinations,...	18	Experimental	3	Python
20	lsidore/AcmeEval Opinionated framework that offers a simple and swift solution for RAG evaluation	15	Experimental	—	TypeScript
21	nitinumaretiya123/eval-dashboard LLM evaluation and benchmarking dashboard for RAG pipelines and fine-tuned models	15	Experimental	1	—
22	VascoSch92/bench-lab The goal is to develop a unified framework for evaluating LLMs, agents, and...	14	Experimental	3	Python
23	alexbeattie/llm-eval-harness Lightweight eval framework for RAG pipelines: retrieval metrics,...	14	Experimental	—	Python
24	busra-yesilbas/llm-evals-observability-lab Framework for evaluating, tracing, and analyzing RAG and LLM systems across...	14	Experimental	—	Python
25	johnnyzyn/lvlm-eval-agents Three-agent LVLM evaluation pipeline (LangGraph + ChromaDB): Rule Retrieval...	14	Experimental	—	Python
26	rutvik29/llm-eval-framework LLM evaluation framework with RAGAS metrics, hallucination detection, safety...	14	Experimental	—	Python
27	peng-gao-lab/CTIArena The first benchmark to evaluate LLM performance on heterogeneous CTI under...	12	Experimental	9	Python
28	QinnniQ/infernomics LLM inference cost–latency–quality optimization dashboard with RAG +...	11	Experimental	—	Python
29	Vaibhavi-Sita/model-arena AI benchmarking platform to evaluate and compare multiple LLM outputs using...	11	Experimental	—	TypeScript
30	Asyasyarif/openjudges OpenJudges is an interactive CLI tool that uses LLMs as judges to evaluate...	11	Experimental	—	Go
31	Trenn1x/model-evaluation-app FastAPI app for LLM evaluation with latency tracking and RAG experiments.	11	Experimental	—	Python
32	sundi133/rageval-ui UI for RagEval https://github.com/sundi133/rag-eval	10	Experimental	1	JavaScript

Comparisons in this category

evalscope and continuous-eval (83 vs 34) evalscope and llm-eval (83 vs 39) evalscope and ragrank (83 vs 45) evalscope and llm-eval-bench (83 vs 22) rubric-eval and llm-eval-bench (52 vs 22)