evalscope and llm-eval-bench

These are competitors—both provide evaluation frameworks for LLMs and RAG systems, but evalscope offers broader coverage (LLM, VLM, AIGC) while llm-eval-bench focuses specifically on prompts and structured outputs, making them alternative choices rather than tools designed to work together.

evalscope

Verified

llm-eval-bench

Experimental

Maintenance 23/25

Adoption 21/25

Maturity 25/25

Community 21/25

Maintenance 13/25

Adoption 0/25

Maturity 9/25

Community 0/25

Stars: 2,501

Forks: 285

Downloads: 29,097

Commits (30d): 36

Language: Python

License: Apache-2.0

Stars: —

Forks: —

Downloads: —

Commits (30d): 0

Language: Python

License: MIT

No risk flags

No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.

About llm-eval-bench

piog/llm-eval-bench

Evaluation harness for prompts, structured outputs, and RAG workflows

Related comparisons

evalscope and ragrank evalscope and llm-eval evalscope and continuous-eval evalscope and rubric-eval

Scores updated daily from GitHub, PyPI, and npm data. How scores work