open-rag-eval and rag-evaluator

These are competitors offering different evaluation methodologies: Vectara's framework enables reference-free evaluation using LLMs to assess RAG quality directly, while AIAnytime's library implements traditional evaluation requiring ground-truth golden answers for comparison.

open-rag-eval

Established

rag-evaluator

Established

Maintenance 6/25

Adoption 16/25

Maturity 25/25

Community 12/25

Maintenance 0/25

Adoption 12/25

Maturity 25/25

Community 19/25

Stars: 347

Forks: 21

Downloads: 645

Commits (30d): 0

Language: Python

License: Apache-2.0

Stars: 42

Forks: 18

Downloads: 65

Commits (30d): 0

Language: Python

License: MIT

No risk flags

Stale 6m

About open-rag-eval

vectara/open-rag-eval

RAG evaluation without the need for "golden answers"

Implements reference-free evaluation metrics (UMBRELA, AutoNuggetizer) based on research from UWaterloo, eliminating the need for golden answers while supporting optional reference-based metrics when available. Provides modular connectors for Vectara, LlamaIndex, and LangChain RAG platforms, with built-in TREC-RAG benchmark metrics and per-query scoring for detailed analysis. Uses LLM judges and open-source hallucination detection models (HHEM) to assess retrieval quality and factual consistency across RAG pipelines.

About rag-evaluator

AIAnytime/rag-evaluator

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

Computes eleven evaluation metrics including BLEU, ROUGE, BERT Score, METEOR, and MAUVE to assess generated responses across semantic similarity, fluency, readability, and bias dimensions. Provides both a Python API for programmatic evaluation and a Streamlit web interface for interactive analysis. Designed for end-to-end RAG pipeline assessment without requiring external model APIs.

Related comparisons

open-rag-eval and rageval open-rag-eval and RAG-evaluation-harnesses open-rag-eval and rageval open-rag-eval and RAG-evaluation-harnesses

Scores updated daily from GitHub, PyPI, and npm data. How scores work