evalscope and llm-eval-bench
These are competitors—both provide evaluation frameworks for LLMs and RAG systems, but evalscope offers broader coverage (LLM, VLM, AIGC) while llm-eval-bench focuses specifically on prompts and structured outputs, making them alternative choices rather than tools designed to work together.
About evalscope
modelscope/evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.
About llm-eval-bench
piog/llm-eval-bench
Evaluation harness for prompts, structured outputs, and RAG workflows
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work