VLMEvalKit and evalplus

The toolkits are complementary, as one (VLMEvalKit) focuses on evaluating large multi-modality models (LMMs) across various benchmarks, while the other (EvalPlus) specializes in the rigorous evaluation of LLM-synthesized code.

VLMEvalKit
72
Verified
evalplus
70
Verified
Maintenance 23/25
Adoption 10/25
Maturity 16/25
Community 23/25
Maintenance 2/25
Adoption 22/25
Maturity 25/25
Community 21/25
Stars: 3,894
Forks: 650
Downloads:
Commits (30d): 21
Language: Python
License: Apache-2.0
Stars: 1,699
Forks: 190
Downloads: 14,725
Commits (30d): 0
Language: Python
License: Apache-2.0
No Package No Dependents
Stale 6m

About VLMEvalKit

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Provides generation-based evaluation across all supported models with dual assessment modes—exact matching and LLM-based answer extraction—eliminating manual data preparation across fragmented benchmark repositories. Supports distributed inference via LMDeploy and VLLM for accelerated evaluation of large-scale deployments, with specialized handling for models with reasoning/thinking modes and long-form outputs exceeding standard cell limits. Integrates with Hugging Face ecosystem (model hosting, datasets, spaces for leaderboards) and supports video benchmarks via ModelScope for comprehensive vision-language assessment.

About evalplus

evalplus/evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Scores updated daily from GitHub, PyPI, and npm data. How scores work