VLMEvalKit and evalplus
The toolkits are complementary, as one (VLMEvalKit) focuses on evaluating large multi-modality models (LMMs) across various benchmarks, while the other (EvalPlus) specializes in the rigorous evaluation of LLM-synthesized code.
About VLMEvalKit
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Provides generation-based evaluation across all supported models with dual assessment modes—exact matching and LLM-based answer extraction—eliminating manual data preparation across fragmented benchmark repositories. Supports distributed inference via LMDeploy and VLLM for accelerated evaluation of large-scale deployments, with specialized handling for models with reasoning/thinking modes and long-form outputs exceeding standard cell limits. Integrates with Hugging Face ecosystem (model hosting, datasets, spaces for leaderboards) and supports video benchmarks via ModelScope for comprehensive vision-language assessment.
About evalplus
evalplus/evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work