// the find
MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
MMMU is an academic benchmark for testing multimodal models on college-level questions across 30 subjects, covering everything from music theory to clinical medicine. It's aimed at researchers evaluating vision-language models, not practitioners building products. The companion MMMU-Pro variant adds harder conditions: text embedded in images and 10-option multiple choice instead of 4.
The breadth is genuine — 183 subfields and 32 image types (chemical structures, music sheets, maps) means you actually stress-test domain-specific visual reasoning, not just captioning. The test set answers were recently released (Feb 2026), so you can now run full evaluations locally without going through EvalAI. MMMU-Pro's vision-only setting (question embedded in the image) is a meaningful difficulty increase that closes the gap between benchmark gaming and real comprehension. The HuggingFace dataset hosting means data loading is one import away.
581 stars for a CVPR 2024 paper is low — the benchmark has been largely superseded in the leaderboard race by harder successors, and the community has moved on. The evaluation code covers only a handful of models (LLaVA 1.5, Qwen-VL, Gemini, GPT) with no abstraction layer; plugging in a new model means copy-pasting an inference script. The repo has essentially no active development since the test-set answer release — open issues and PRs are not being addressed. Multiple choice as the sole format also means you're measuring answer selection, not generation, which misses a lot of what makes multimodal reasoning hard.