VLMEvalKit and evalplus

The toolkits are complementary, as one (VLMEvalKit) focuses on evaluating large multi-modality models (LMMs) across various benchmarks, while the other (EvalPlus) specializes in the rigorous evaluation of LLM-synthesized code.

VLMEvalKit

Verified

evalplus

Verified

Maintenance 23/25

Adoption 10/25

Maturity 16/25

Community 23/25

Maintenance 2/25

Adoption 22/25

Maturity 25/25

Community 21/25

Stars: 3,894

Forks: 650

Downloads: —

Commits (30d): 21

Language: Python

License: Apache-2.0

Stars: 1,699

Forks: 190

Downloads: 14,725

Commits (30d): 0

Language: Python

License: Apache-2.0

No Package No Dependents

Stale 6m

About VLMEvalKit

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Provides generation-based evaluation across all supported models with dual assessment modes—exact matching and LLM-based answer extraction—eliminating manual data preparation across fragmented benchmark repositories. Supports distributed inference via LMDeploy and VLLM for accelerated evaluation of large-scale deployments, with specialized handling for models with reasoning/thinking modes and long-form outputs exceeding standard cell limits. Integrates with Hugging Face ecosystem (model hosting, datasets, spaces for leaderboards) and supports video benchmarks via ModelScope for comprehensive vision-language assessment.

About evalplus

evalplus/evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Related comparisons

VLMEvalKit and lmms-eval VLMEvalKit and SciEvalKit VLMEvalKit and evaluation-guidebook

Scores updated daily from GitHub, PyPI, and npm data. How scores work