VLMEvalKit and SciEvalKit
These two tools are competitors, with VLMEvalKit offering broader LMM support and benchmarks, while SciEvalKit provides a specialized evaluation toolkit and leaderboard focused on scientific intelligence across the full research workflow.
About VLMEvalKit
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Provides generation-based evaluation across all supported models with dual assessment modes—exact matching and LLM-based answer extraction—eliminating manual data preparation across fragmented benchmark repositories. Supports distributed inference via LMDeploy and VLLM for accelerated evaluation of large-scale deployments, with specialized handling for models with reasoning/thinking modes and long-form outputs exceeding standard cell limits. Integrates with Hugging Face ecosystem (model hosting, datasets, spaces for leaderboards) and supports video benchmarks via ModelScope for comprehensive vision-language assessment.
About SciEvalKit
InternScience/SciEvalKit
A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision–language models across the full research workflow.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work