// the find

karpathy/autoresearch

★ 88,456 · Python · updated Mar 2026

AI agents running research on single-GPU nanochat training automatically

autoresearch gives an AI agent a minimal single-GPU LLM training setup and lets it run overnight experiments autonomously — edit train.py, measure val_bpb, keep improvements, discard regressions, repeat. It's Karpathy's take on automated ML research distilled to three files and one metric. Aimed at researchers who want to explore architecture and optimizer ideas without babysitting each run.

The fixed 5-minute time budget per experiment is a genuinely good constraint: it makes runs comparable within a platform regardless of model size or batch changes, and means the agent optimizes for your specific hardware. The scope is deliberately narrow — one file the agent touches, one metric to optimize — which keeps diffs reviewable and the loop honest. Using val_bpb as the metric is the right call: it's vocab-size-independent so architectural experiments that change tokenization aren't penalized unfairly. The program.md indirection is clever; you're writing research org instructions, not code, which is a natural way to steer what the agent tries next.

Requires an H100 or similar NVIDIA GPU to run as-is — the README acknowledges this and punts CPU/MPS support to forks, so most people can't reproduce the headline results without renting hardware. Results are not portable across machines since the fixed wall-clock budget means an H100 run and a 4090 run are incomparable, which limits community benchmarking. There's no experiment tracking beyond whatever the agent logs to stdout — no database of runs, no diff history of train.py changes, no way to retrospectively understand why a change helped. The agent integration is entirely manual: you paste a prompt into Claude or Codex yourself; there's no wrapper that actually closes the loop programmatically, so 'autonomous' is somewhat aspirational unless you build the scaffolding yourself.

View on GitHub →