// the find

EvolvingLMMs-Lab/lmms-eval

★ 4,261 · Python · NOASSERTION · updated Jun 2026

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

lmms-eval is an evaluation harness for multimodal language models — images, video, audio, and text — built as a fork of EleutherAI's lm-eval-harness. It targets ML researchers and teams that need reproducible benchmark numbers across 100+ tasks without stitching together separate pipelines for each modality.

The HTTP eval server with async job submission is genuinely useful for decoupled training/eval workflows — submit a checkpoint eval job and keep training without blocking. The two-tier model API (Chat vs Simple) is a clean architectural decision; the Chat path with structured `ChatMessages` and `apply_chat_template()` handles interleaved multimodal content properly instead of the `<image>` placeholder hack. Statistical rigor — confidence intervals, paired t-tests, clustered standard errors — puts it ahead of most eval tools that report single numbers as if they're facts. The TorchCodec video I/O overhaul (up to 3.58x faster) addresses a real bottleneck that makes video eval painful on most setups.

The model support story is fragmented by design — because HuggingFace hasn't unified multimodal input/output formats, every model family needs its own wrapper class, and there are currently 80+ of them across `models/simple/` and `models/chat/`. That's a maintenance burden that scales poorly and creates an inconsistent experience when a model you care about only has a Simple wrapper. The eval server has no authentication at all and the README just says 'trusted environments only' — that's a reasonable scope call but it means you can't expose it even on an internal network without bolting something on yourself. Dependency hell is real here: the README explicitly documents pinning httpx, protobuf, and numpy to work around conflicts, which is a sign of accumulated debt in the dependency graph. Task configs scattered across 100+ YAML files with no validation tooling means silent misconfiguration is easy to miss until your numbers look wrong.

View on GitHub → Homepage ↗