evalscope and continuous-eval

These are complementary tools—evalscope provides a broad evaluation framework across multiple model types (LLMs, VLMs, AIGC), while continuous-eval specializes in production-focused, data-driven evaluation metrics specifically optimized for LLM-powered applications, allowing teams to use both for different evaluation stages and purposes.

evalscope

Verified

continuous-eval

Emerging

Maintenance 23/25

Adoption 21/25

Maturity 25/25

Community 21/25

Maintenance 0/25

Adoption 10/25

Maturity 16/25

Community 15/25

Stars: 2,501

Forks: 285

Downloads: 29,097

Commits (30d): 36

Language: Python

License: Apache-2.0

Stars: 516

Forks: 37

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

No risk flags

Stale 6m No Package No Dependents

About evalscope

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Supports pluggable backend evaluation engines (OpenCompass, VLMEvalKit, RAGAS, MTEB) and integrates multi-modal benchmarks across LLMs, VLMs, embedding models, and code tasks through a registry-based architecture. Features performance profiling with latency metrics (TTFT, TPOT), SLA auto-tuning for service concurrency limits, and interactive WebUI dashboards powered by Gradio/Wandb for comparative analysis and arena-style model battles.

About continuous-eval

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

""" pii_check = CustomMetric( name="pii_check", criteria=criteria, rubric=rubric, metric_type="discrete", # can be 'discrete' or 'continuous' ) result = pii_check(answer="My name is John.") print(result) ``` ## Features - Modularized evaluation (evaluate each pipeline module with tailored metrics) - Metric library with deterministic, semantic, and LLM-based metrics - Support for probabilistic evaluation - Isolation of Pipeline components - Support for custom metrics and tests - Distributed evaluation (using Ray) - Integration with OpenAI and other LLM providers - All major frameworks (LangChain, LlamaIndex, Ollama, VertexAI, etc.) - Comprehensive documentation with examples ##

Related comparisons

evalscope and ragrank evalscope and llm-eval evalscope and llm-eval-bench

Scores updated daily from GitHub, PyPI, and npm data. How scores work