// the find

aiming-lab/AutoResearchClaw

★ 13,386 · Python · MIT · updated Jun 2026

Fully autonomous & self-evolving research from idea to paper. Chat an Idea. Get a Paper. 🦞

AutoResearchClaw is a 23-stage Python pipeline that takes a plain-text research topic and outputs a LaTeX paper with real literature citations, sandbox-executed experiments, figures, and multi-agent peer review. It targets ML researchers and academics who want to rapidly prototype research ideas or explore new directions, with optional human-in-the-loop collaboration at key decision points. Think of it as a research scaffolding tool, not a replacement for original scientific thought.

4-layer citation verification (arXiv ID → CrossRef DOI → Semantic Scholar title match → LLM relevance scoring) actually addresses the hallucinated-references problem that plagues every other LLM writing tool — fake references get killed before the paper is written, not after. The HITL co-pilot system has real depth: six intervention modes with per-stage policies, a 'SmartPause' that triggers on low-confidence stages, and CLI attach/approve/reject commands that let you supervise a running pipeline from another terminal. ARC-Bench (55-topic benchmark across ML, physics, quantum, biology, statistics with rubrics) is a genuine contribution independent of the main tool — the field needs a standardized eval for autonomous research agents. Self-healing experiment execution with NaN/Inf detection, AST-validated code generation, and up to 10 iterative repair rounds is the right architecture for unattended sandbox runs.

The showcase papers are all low-stakes ML recombination tasks (random matrix theory, LoRA variants, token merging ablations) — there is zero evidence the pipeline produces anything a domain expert would consider novel, and the 'fully autonomous' framing papers over that gap. The star count (13K in roughly 3 months) is a hype signal, not a validation signal — almost no public evidence of anyone running this to completion on a real research problem they cared about. The config surface is enormous: one YAML file controls 23 stages, 6 agent subsystems, HITL policies, MetaClaw bridge, Docker networking, and cost guardrails, which means when something breaks at stage 14 of a multi-hour run you're debugging a 300-line YAML against sparse logs. The 'fully autonomous from idea to paper' claim quietly assumes your research question has answers findable via OpenAlex + Semantic Scholar and runnable in a Python sandbox in under 5 minutes — anything requiring domain infrastructure (wet lab, specialized simulators, proprietary datasets) hits a hard wall fast.

View on GitHub →