RAG Evaluation Benchmarking RAG Tools

Frameworks and tools for evaluating, benchmarking, and scoring RAG systems, LLMs, and prompts through automated testing, metrics, and judge-based assessment. Does NOT include general observability/monitoring, hallucination detection as standalone tools, or prompt engineering platforms.

There are 32 rag evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is modelscope/evalscope at 83/100 with 2,501 stars and 29,097 monthly downloads. 1 of the top 10 are actively maintained.

Get all 32 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM,...

83
Verified
2 Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

52
Established
3 izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts,...

45
Emerging
4 justplus/llm-eval

大语言模型评估平台,支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。

39
Emerging
5 dokimos-dev/dokimos

Evaluation Framework for LLM applications in Java and Kotlin

37
Emerging
6 cleanlab/tlm

Score the trustworthiness of outputs from any LLM in real-time

36
Emerging
7 relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

34
Emerging
8 ccmbioinfo/ccm_benchmate

A knowledge aggregator for computational biology

32
Emerging
9 Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via...

31
Emerging
10 zjunlp/InnoEval

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded,...

30
Emerging
11 kolenaIO/autoarena

Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation

30
Emerging
12 amitbad/llm-evaluation

Hands-on LLM evaluation learning repo — local models via Ollama, no paid...

29
Experimental
13 saikiranAnnam/TraceDog

TraceDog — LLM Evaluation and AI observability platform to trace, evaluate,...

24
Experimental
14 piog/llm-eval-bench

Evaluation harness for prompts, structured outputs, and RAG workflows

22
Experimental
15 wwx99921/llm-rank

Rank passages using BM25 and large language models to improve retrieval...

22
Experimental
16 Lengi96/ai-qa-framework

Requirements-driven AI/LLM QA framework with traceability, release gates,...

22
Experimental
17 honeyhiveai/realign

Realign is a testing and simulation framework for AI applications.

20
Experimental
18 dariero/RagaliQ

LLM & RAG evaluation testing framework✨

20
Experimental
19 dataaispark-spec/TrustScoreEval

TrustScoreEval: Trust Scores for AI/LLM Responses — Detect hallucinations,...

18
Experimental
20 lsidore/AcmeEval

Opinionated framework that offers a simple and swift solution for RAG evaluation

15
Experimental
21 nitinumaretiya123/eval-dashboard

LLM evaluation and benchmarking dashboard for RAG pipelines and fine-tuned models

15
Experimental
22 VascoSch92/bench-lab

The goal is to develop a unified framework for evaluating LLMs, agents, and...

14
Experimental
23 alexbeattie/llm-eval-harness

Lightweight eval framework for RAG pipelines: retrieval metrics,...

14
Experimental
24 busra-yesilbas/llm-evals-observability-lab

Framework for evaluating, tracing, and analyzing RAG and LLM systems across...

14
Experimental
25 johnnyzyn/lvlm-eval-agents

Three-agent LVLM evaluation pipeline (LangGraph + ChromaDB): Rule Retrieval...

14
Experimental
26 rutvik29/llm-eval-framework

LLM evaluation framework with RAGAS metrics, hallucination detection, safety...

14
Experimental
27 peng-gao-lab/CTIArena

The first benchmark to evaluate LLM performance on heterogeneous CTI under...

12
Experimental
28 QinnniQ/infernomics

LLM inference cost–latency–quality optimization dashboard with RAG +...

11
Experimental
29 Vaibhavi-Sita/model-arena

AI benchmarking platform to evaluate and compare multiple LLM outputs using...

11
Experimental
30 Asyasyarif/openjudges

OpenJudges is an interactive CLI tool that uses LLMs as judges to evaluate...

11
Experimental
31 Trenn1x/model-evaluation-app

FastAPI app for LLM evaluation with latency tracking and RAG experiments.

11
Experimental
32 sundi133/rageval-ui

UI for RagEval https://github.com/sundi133/rag-eval

10
Experimental