RAG Evaluation Benchmarking RAG Tools
Frameworks and tools for evaluating, benchmarking, and scoring RAG systems, LLMs, and prompts through automated testing, metrics, and judge-based assessment. Does NOT include general observability/monitoring, hallucination detection as standalone tools, or prompt engineering platforms.
There are 32 rag evaluation benchmarking tools tracked. 1 score above 70 (verified tier). The highest-rated is modelscope/evalscope at 83/100 with 2,501 stars and 29,097 monthly downloads. 1 of the top 10 are actively maintained.
Get all 32 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=rag-evaluation-benchmarking&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
modelscope/evalscope
A streamlined and customizable framework for efficient large model (LLM,... |
|
Verified |
| 2 |
Kareem-Rashed/rubric-eval
Independent framework to test, benchmark, and evaluate LLMs & AI agents locally. |
|
Established |
| 3 |
izam-mohammed/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts,... |
|
Emerging |
| 4 |
justplus/llm-eval
大语言模型评估平台,支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。 |
|
Emerging |
| 5 |
dokimos-dev/dokimos
Evaluation Framework for LLM applications in Java and Kotlin |
|
Emerging |
| 6 |
cleanlab/tlm
Score the trustworthiness of outputs from any LLM in real-time |
|
Emerging |
| 7 |
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications |
|
Emerging |
| 8 |
ccmbioinfo/ccm_benchmate
A knowledge aggregator for computational biology |
|
Emerging |
| 9 |
Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via... |
|
Emerging |
| 10 |
zjunlp/InnoEval
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded,... |
|
Emerging |
| 11 |
kolenaIO/autoarena
Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation |
|
Emerging |
| 12 |
amitbad/llm-evaluation
Hands-on LLM evaluation learning repo — local models via Ollama, no paid... |
|
Experimental |
| 13 |
saikiranAnnam/TraceDog
TraceDog — LLM Evaluation and AI observability platform to trace, evaluate,... |
|
Experimental |
| 14 |
piog/llm-eval-bench
Evaluation harness for prompts, structured outputs, and RAG workflows |
|
Experimental |
| 15 |
wwx99921/llm-rank
Rank passages using BM25 and large language models to improve retrieval... |
|
Experimental |
| 16 |
Lengi96/ai-qa-framework
Requirements-driven AI/LLM QA framework with traceability, release gates,... |
|
Experimental |
| 17 |
honeyhiveai/realign
Realign is a testing and simulation framework for AI applications. |
|
Experimental |
| 18 |
dariero/RagaliQ
LLM & RAG evaluation testing framework✨ |
|
Experimental |
| 19 |
dataaispark-spec/TrustScoreEval
TrustScoreEval: Trust Scores for AI/LLM Responses — Detect hallucinations,... |
|
Experimental |
| 20 |
lsidore/AcmeEval
Opinionated framework that offers a simple and swift solution for RAG evaluation |
|
Experimental |
| 21 |
nitinumaretiya123/eval-dashboard
LLM evaluation and benchmarking dashboard for RAG pipelines and fine-tuned models |
|
Experimental |
| 22 |
VascoSch92/bench-lab
The goal is to develop a unified framework for evaluating LLMs, agents, and... |
|
Experimental |
| 23 |
alexbeattie/llm-eval-harness
Lightweight eval framework for RAG pipelines: retrieval metrics,... |
|
Experimental |
| 24 |
busra-yesilbas/llm-evals-observability-lab
Framework for evaluating, tracing, and analyzing RAG and LLM systems across... |
|
Experimental |
| 25 |
johnnyzyn/lvlm-eval-agents
Three-agent LVLM evaluation pipeline (LangGraph + ChromaDB): Rule Retrieval... |
|
Experimental |
| 26 |
rutvik29/llm-eval-framework
LLM evaluation framework with RAGAS metrics, hallucination detection, safety... |
|
Experimental |
| 27 |
peng-gao-lab/CTIArena
The first benchmark to evaluate LLM performance on heterogeneous CTI under... |
|
Experimental |
| 28 |
QinnniQ/infernomics
LLM inference cost–latency–quality optimization dashboard with RAG +... |
|
Experimental |
| 29 |
Vaibhavi-Sita/model-arena
AI benchmarking platform to evaluate and compare multiple LLM outputs using... |
|
Experimental |
| 30 |
Asyasyarif/openjudges
OpenJudges is an interactive CLI tool that uses LLMs as judges to evaluate... |
|
Experimental |
| 31 |
Trenn1x/model-evaluation-app
FastAPI app for LLM evaluation with latency tracking and RAG experiments. |
|
Experimental |
| 32 |
sundi133/rageval-ui
UI for RagEval https://github.com/sundi133/rag-eval |
|
Experimental |