// the find
EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
The de facto standard benchmarking framework for LLMs, used by HuggingFace's Open LLM Leaderboard and hundreds of papers. It supports 60+ academic benchmarks with a unified interface that works across local HF models, vLLM, SGLang, and commercial APIs. Primarily aimed at ML researchers and teams who need reproducible, comparable evaluation numbers.
- Backend coverage is genuinely impressive: HF transformers, vLLM, SGLang, llama.cpp, NeMo, Megatron-LM, OpenVINO, and a dozen commercial APIs all work through the same task interface, so switching backends doesn't require rewriting eval logic.
- YAML-based task configuration lets you define or modify benchmarks without touching Python — Jinja2 prompt templating, answer extraction regex, and metric selection are all declarable, which makes adding new tasks or prompt variants low-friction.
- The parallelism story is well thought out: data-parallel via accelerate launcher, tensor-parallel via parallelize=True or tp_plan=auto, and vLLM/SGLang handle their own batching — you have real options for large models instead of one awkward path.
- It's the reference implementation used for public leaderboards, which means results are directly comparable to published numbers without having to reverse-engineer someone else's eval setup.
- The tasks directory has exploded into hundreds of YAML files with massive duplication — afrimgsm alone has 5 prompt variants × 20 languages × 2 cot/non-cot modes as separate files. There's no clear templating discipline enforced, so the repo is hard to navigate and PRs adding tasks routinely just copy-paste existing YAMLs.
- vLLM and HF can produce different logprob results, and the harness acknowledges this with a comparator script rather than actually resolving the discrepancy — if you're doing loglikelihood-based evaluations, you need to verify which backend you trust and stick with it.
- Multimodal support (hf-multimodal, vllm-vlm) is explicitly prototyped and incomplete; the README itself redirects users to a fork (lmms-eval) for serious multimodal work, so this isn't something you can rely on.
- No native multi-node support for HF models — the docs just say 'use an external inference server' and point to GPT-NeoX as an example. For anyone running evaluations on clusters without vLLM, this is a real gap.