// the find

bigcode-project/bigcodebench

★ 509 · Python · Apache-2.0 · updated Jan 2026

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

BigCodeBench is a code generation benchmark with 1140 tasks that test LLMs on real-world programming scenarios requiring diverse library calls and multi-step reasoning, not just toy algorithmic puzzles. It's aimed at ML researchers and teams evaluating which model to use for code generation tasks. The ICLR'25 acceptance and adoption by DeepSeek, Meta, Qwen, and others gives it credibility as a serious eval.

The split into Complete (docstring-driven) and Instruct (natural language) variants lets you test base models and chat models separately without conflating them. Pre-generated samples from 160+ models are publicly available, so you can compare your model against the field without running everything yourself. The decontamination artifacts (n-gram overlap checks against StackOverflow and StarCoder training data) show the authors actually thought about dataset contamination, which most benchmarks skip. Docker images for both generation and evaluation make the pipeline reproducible without wrestling with dependency hell.

The `tools/fix_v019.py` through `fix_v025.py` patch files are a red flag — five consecutive version-specific fixup scripts suggests the dataset has had ongoing correctness issues that required retroactive patching rather than a clean versioned release process. Batch inference results vary by batch size due to vLLM behavior, which the README admits but doesn't fully resolve — you can set batch size to 1 for greedy, but this makes full evaluations expensive. Remote evaluation depends on either E2B (slow, paid) or a Hugging Face Gradio space (also slow, shared), so running this at any scale requires standing up your own evaluator endpoint. The last push was January 2026 and the repo feels like it's in maintenance mode with no active development trajectory beyond pointing at BigCodeArena.

View on GitHub → Homepage ↗