// the find
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
OpenCompass is a benchmarking framework for evaluating LLMs across 100+ datasets covering reasoning, math, code, long-context, and subjective tasks. It supports both local HuggingFace models and API-based models (OpenAI, Claude, etc.) with distributed evaluation via a config-file-driven pipeline. Primary audience is ML researchers who need reproducible, standardized benchmark runs rather than vibe-checking a model.
The backend-agnostic inference layer is genuinely useful — you can swap between HuggingFace, vLLM, and LMDeploy with a single CLI flag, which matters when you're comparing models at different scales. The dataset coverage is wide and growing fast, with proper versioned configs so runs are reproducible (the hash suffixes on config filenames like `ARC_c_gen_1e0de5.py` serve a real purpose). The CascadeEvaluator added in 2025 lets you chain rule-based and LLM-judge evaluation steps, which is the right answer for math/reasoning benchmarks where regex matching alone fails. Active maintenance from a real org (Shanghai AI Lab) with CI pipelines and a public leaderboard means benchmark configs track model releases quickly.
Setup friction is real — datasets aren't bundled, you either wget a zip from a 2024 release or configure ModelScope, and the supported-dataset list for auto-download is a fraction of what the framework can run. The config system is Python-file-based mmcv-style, which is powerful but produces a proliferation of near-identical files with hash suffixes and no obvious way to know which is current without reading the docs. Dependency conflicts between inference backends (vLLM vs LMDeploy) require separate virtual environments, which the README buries as a tip rather than a first-class concern. The project is heavily oriented toward Chinese-language model evaluation (InternLM, Qwen, CEVAL, CMMLU get most attention) — if you're evaluating English-only models, you'll spend time filtering noise.