evalscope and continuous-eval
These are complementary tools—evalscope provides a broad evaluation framework across multiple model types (LLMs, VLMs, AIGC), while continuous-eval specializes in production-focused, data-driven evaluation metrics specifically optimized for LLM-powered applications, allowing teams to use both for different evaluation stages and purposes.
About evalscope
modelscope/evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.
About continuous-eval
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work