evalscope and ragrank
These are complements: evalscope provides a general-purpose LLM evaluation framework while ragrank specializes in RAG-specific metrics (factual accuracy, context understanding, tone), allowing them to be used together for comprehensive RAG system evaluation.
About evalscope
modelscope/evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.
About ragrank
izam-mohammed/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.
Specialized for RAG pipeline evaluation with metrics like response relevancy, context understanding, and factual accuracy. Built as a Python toolkit that integrates with OpenAI's API by default but supports custom LLM models, enabling flexible assessment workflows through a dataset-to-metrics evaluation pattern. Provides structured evaluation results exportable to dataframes for analysis and integration with downstream data processing pipelines.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work