// the find

openai/evals

★ 18,672 · Python · NOASSERTION · updated Apr 2026

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

OpenAI's framework for building and running evaluations against LLMs, paired with a registry of community-contributed benchmarks. Aimed at teams who need repeatable, structured tests for model quality as they upgrade or fine-tune — the kind of thing you reach for when 'it seems better' isn't good enough. The eval registry itself is the real asset; the framework is glue code around it.

The YAML-driven eval definition means non-engineers can contribute test cases without touching Python — data in JSON, grading logic in a config file. The model-graded eval pattern (using a second LLM as judge) is well thought out and handles cases where exact-match scoring breaks down. The elsuite collection covers genuinely interesting agent scenarios (tool-use, multi-turn games, ML agent benchmarks) that go beyond 'does the model know the capital of France'. Git LFS for the data registry is the right call — keeps the repo cloneable without pulling gigabytes by default.

It's effectively OpenAI-only despite the 'LLM-agnostic' framing — the completion function abstraction exists, but the tooling, docs, and registry all assume you're calling OpenAI endpoints. Running this against Claude or Gemini is an afterthought. The README now opens with 'just use the dashboard instead', which is a signal that this repo is in maintenance mode rather than active development. The elsuite evals each have their own bespoke structure with no shared conventions, so reading one doesn't help you read the next. Snowflake as the only supported results backend is a strange choice that most teams will have to ignore or work around.

View on GitHub →