// the find
stanford-crfm/helm
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
HELM is Stanford CRFM's framework for running standardized evaluations across dozens of LLM benchmarks (MMLU, GPQA, WildBench, etc.) against models from OpenAI, Anthropic, Google, and others. It's aimed at researchers and ML engineers who need reproducible, apples-to-apples benchmark comparisons rather than trusting vendor-reported numbers. Note: it entered maintenance mode June 1, 2026.
The unified adapter layer is genuinely useful — you write one run spec and it handles prompt formatting, few-shot construction, and metric collection across wildly different benchmarks. The caching layer for API responses means you can re-run analysis without re-spending API budget. The web UI for inspecting individual prompt/response pairs is the kind of thing that catches prompt format bugs that aggregate scores hide. Coverage is broad: text, vision-language, audio, image generation, medical — most serious evaluation needs are already scenarios.
Maintenance mode is a real concern for production adoption — bug fixes will happen but new model integrations and benchmark additions are unlikely to land quickly, which in a field moving this fast means it will drift behind. The codebase is large and the abstraction layers (scenarios, adapters, annotators, metrics, runners) add real onboarding friction; adding a new benchmark is not a 30-minute job. Running full evaluations at scale requires significant API spend and wall-clock time with no built-in cost estimation before you commit. The frontend is a static file server tied to pre-baked JSON outputs, so you can't query or filter results dynamically without re-running summarization.