// the find

open-compass/VLMEvalKit

★ 4,219 · Python · Apache-2.0 · updated Jun 2026

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

VLMEvalKit is a unified evaluation harness for vision-language models — you point it at one of 220+ supported models and one of 80+ benchmarks and it handles everything: data download, inference, scoring. It's maintained by the OpenCompass team at Shanghai AI Lab and is the backend for the HuggingFace Open VLM Leaderboard, so the numbers it produces are the ones people actually cite in papers.

The abstraction is well-designed: adding a new model requires implementing only `generate_inner()`, and the framework handles batching, retries, data prep, and metric calculation. LLM-based answer extraction as a fallback over exact matching is a genuinely useful design choice that catches valid answers that don't match the expected string exactly. Multi-node distributed inference via LMDeploy and vLLM was added for thinking models, which matters when you're running 72B+ parameter models. The leaderboard integration means evaluation results here are directly comparable to published state-of-the-art numbers.

The transformers version matrix is a maintenance nightmare — different models require specific versions ranging from 4.33.0 to 5.2.0+, which means you essentially can't evaluate multiple model families in the same environment without container isolation. Results may not reproduce exact paper numbers because the toolkit uses generation-based evaluation where some benchmarks use PPL-based evaluation, so you can't use this to directly validate a paper's claimed score. The project structure sprawls considerably — 100+ dataset files in a flat directory with inconsistent naming conventions — and there's no evidence of a clean plugin system despite the stated goal of easy extensibility. Checkpointing/resumability for long evaluation runs across hundreds of questions is handled through TSV/xlsx output files rather than a proper job system, which is fragile on preemptible compute.

View on GitHub → Homepage ↗