open-rag-eval and rag-evaluator
These are competitors offering different evaluation methodologies: Vectara's framework enables reference-free evaluation using LLMs to assess RAG quality directly, while AIAnytime's library implements traditional evaluation requiring ground-truth golden answers for comparison.
About open-rag-eval
vectara/open-rag-eval
RAG evaluation without the need for "golden answers"
Implements reference-free evaluation metrics (UMBRELA, AutoNuggetizer) based on research from UWaterloo, eliminating the need for golden answers while supporting optional reference-based metrics when available. Provides modular connectors for Vectara, LlamaIndex, and LangChain RAG platforms, with built-in TREC-RAG benchmark metrics and per-query scoring for detailed analysis. Uses LLM judges and open-source hallucination detection models (HHEM) to assess retrieval quality and factual consistency across RAG pipelines.
About rag-evaluator
AIAnytime/rag-evaluator
A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).
Computes eleven evaluation metrics including BLEU, ROUGE, BERT Score, METEOR, and MAUVE to assess generated responses across semantic similarity, fluency, readability, and bias dimensions. Provides both a Python API for programmatic evaluation and a Streamlit web interface for interactive analysis. Designed for end-to-end RAG pipeline assessment without requiring external model APIs.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work