rag-evaluator and rageval
These two tools are competitors, as both provide libraries for evaluating Retrieval-Augmented Generation (RAG) systems.
About rag-evaluator
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
Computes eleven evaluation metrics including BLEU, ROUGE, BERT Score, METEOR, and MAUVE to assess generated responses across semantic similarity, fluency, readability, and bias dimensions. Provides both a Python API for programmatic evaluation and a Streamlit web interface for interactive analysis. Designed for end-to-end RAG pipeline assessment without requiring external model APIs.
About rageval
gomate-community/rageval
Evaluation tools for Retrieval-augmented Generation (RAG) methods.
Provides modular evaluation across six RAG pipeline stages—query rewriting, retrieval, compression, evidence verification, generation, and validation—with 30+ metrics spanning answer correctness (F1, ROUGE, EM), groundedness (citation precision/recall), and context adequacy. Supports both LLM-based and string-matching evaluators, with pluggable integrations for OpenAI APIs or open-source models via vllm. Includes benchmark implementations on ASQA and other QA datasets with reproducible evaluation scripts.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work