// the find

suyoumo/ClawProBench

★ 809 · Python · Apache-2.0 · updated Jun 2026

ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

ClawProBench is a benchmark harness for evaluating LLM agents specifically inside the OpenClaw runtime — a particular agent execution environment. It runs scenarios live (actual agent execution, not mocked), grades deterministically, and supports multi-trial runs to measure reliability. It's for researchers or teams who use OpenClaw and want reproducible capability comparisons across models.

The live-first execution model is the right call — replaying transcripts tells you almost nothing about agent reliability. The FinalScore formula weighting pass^3 most heavily (stable repeated success) over pass@3 (lucky single success) is a principled choice that penalizes variance appropriately. Checkpoint resume and rerun-on-failure support make long evaluation runs practical rather than all-or-nothing. The scenario catalog is broad across distinct capability domains: constraints, error recovery, planning, safety, synthesis, tool use.

The whole thing is load-bearing on OpenClaw being available and working correctly, which means zero utility if you're not already in that ecosystem — there's no abstraction layer for other agent runtimes. The leaderboard is operated by the repo author who also contacts vendors for model access, which is a conflict of interest even if unintentional; there's no third-party audit of submitted results. The contributor list literally says 'waiting', and with 52 forks vs 809 stars the community engagement is thin, which matters a lot for a benchmark that needs diverse scenario contributions to avoid gaming. The incubating-to-active promotion criteria aren't documented, so it's unclear whether the 60 incubating scenarios are good tasks being validated or abandoned work.

View on GitHub → Homepage ↗