modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

/ 100

Verified

Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.

2,501 stars and 29,097 monthly downloads. Used by 1 other package. Actively maintained with 36 commits in the last 30 days. Available on PyPI.

Maintenance 23 / 25

Adoption 21 / 25

Maturity 18 / 25

Community 21 / 25

How are scores calculated?

Stars

2,501

Forks

285

Language

Python

License

Apache-2.0

Compare

evalscope and ragrank evalscope and llm-eval evalscope and continuous-eval evalscope and llm-eval-bench

Related tools

Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...

justplus/llm-eval

大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。

dokimos-dev/dokimos

Evaluation Framework for LLM applications in Java and Kotlin

cleanlab/tlm

Score the trustworthiness of outputs from any LLM in real-time

Explore RAG Tools

All categories Trending RAG directory Insights