// the find
modal-labs/devlooper
A program synthesis agent that autonomously fixes its output by running tests!
devlooper is a code generation agent that wraps smol-developer with an autonomous test-fix loop: it generates code, runs tests in a Modal sandbox, feeds failures back to GPT-4, and iterates until tests pass. It targets developers who want to experiment with agentic code synthesis without babysitting the LLM. Hard dependency on both Modal and OpenAI means it's not self-hostable without rewriting core parts.
- The two-step debug loop (diagnosis first, then DebugPlan) is a smart design choice that measurably improves accuracy over asking the LLM to jump straight to a fix — this mirrors chain-of-thought benefits in a concrete, structured way.
- Using Modal Sandboxes for isolated, incrementally cached test execution is cleaner than spinning up Docker containers yourself; the image layer caching means repeated package installs don't re-run from scratch.
- The DebugPlan abstraction (fix file / install package / run command) is minimal but extensible, and the codebase is small enough (~5 files) that you can actually read and modify it in an afternoon.
- Supports three real language/framework targets (React/Jest, Python, Rust) rather than being Python-only, which gives it broader practical usefulness as a demo.
- Hard vendor lock-in to Modal makes this unusable if you don't have or want a Modal account — there's no local execution fallback, and porting the sandbox logic to plain Docker would require non-trivial work.
- No loop termination guard beyond test passage: if the LLM keeps generating subtly broken code, it will burn your OpenAI credits indefinitely. The README mentions this as a known issue but there's no max-iteration cap in the current code.
- Last meaningful commit activity suggests the project is essentially abandoned — the 'Showcase' section is still 'Coming soon' and the listed future directions haven't materialized, so adopters are on their own for bugs and improvements.
- Test harness is entirely LLM-generated along with the code, meaning the tests themselves can be trivially wrong or tautological — passing tests don't guarantee the output actually does what the prompt asked.