evalscope and ragrank

These are complements: evalscope provides a general-purpose LLM evaluation framework while ragrank specializes in RAG-specific metrics (factual accuracy, context understanding, tone), allowing them to be used together for comprehensive RAG system evaluation.

evalscope
90
Verified
ragrank
52
Established
Maintenance 23/25
Adoption 21/25
Maturity 25/25
Community 21/25
Maintenance 10/25
Adoption 8/25
Maturity 16/25
Community 18/25
Stars: 2,501
Forks: 285
Downloads: 29,097
Commits (30d): 36
Language: Python
License: Apache-2.0
Stars: 45
Forks: 14
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
No risk flags
No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.

About ragrank

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

Specialized for RAG pipeline evaluation with metrics like response relevancy, context understanding, and factual accuracy. Built as a Python toolkit that integrates with OpenAI's API by default but supports custom LLM models, enabling flexible assessment workflows through a dataset-to-metrics evaluation pattern. Provides structured evaluation results exportable to dataframes for analysis and integration with downstream data processing pipelines.

Scores updated daily from GitHub, PyPI, and npm data. How scores work