evalscope and continuous-eval

These are complementary tools—evalscope provides a broad evaluation framework across multiple model types (LLMs, VLMs, AIGC), while continuous-eval specializes in production-focused, data-driven evaluation metrics specifically optimized for LLM-powered applications, allowing teams to use both for different evaluation stages and purposes.

evalscope
90
Verified
continuous-eval
41
Emerging
Maintenance 23/25
Adoption 21/25
Maturity 25/25
Community 21/25
Maintenance 0/25
Adoption 10/25
Maturity 16/25
Community 15/25
Stars: 2,501
Forks: 285
Downloads: 29,097
Commits (30d): 36
Language: Python
License: Apache-2.0
Stars: 516
Forks: 37
Downloads: —
Commits (30d): 0
Language: Python
License: Apache-2.0
No risk flags
Stale 6m No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.

About continuous-eval

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

""" pii_check = CustomMetric( name="pii_check", criteria=criteria, rubric=rubric, metric_type="discrete", # can be 'discrete' or 'continuous' ) result = pii_check(answer="My name is John.") print(result) ``` ## Features - Modularized evaluation (evaluate each pipeline module with tailored metrics) - Metric library with deterministic, semantic, and LLM-based metrics - Support for probabilistic evaluation - Isolation of Pipeline components - Support for custom metrics and tests - Distributed evaluation (using Ray) - Integration with OpenAI and other LLM providers - All major frameworks (LangChain, LlamaIndex, Ollama, VertexAI, etc.) - Comprehensive documentation with examples ##

Scores updated daily from GitHub, PyPI, and npm data. How scores work