// the find

starpig1129/DATAGEN

★ 1,764 · Python · MIT · updated Jun 2026

DATAGEN: AI-driven multi-agent research assistant automating hypothesis generation, data analysis, and report writing.

DATAGEN is a multi-agent pipeline for automated data analysis: feed it a CSV, it generates hypotheses, writes analysis code, runs it, creates visualizations, and produces a report. It targets data scientists or researchers who want to automate exploratory work rather than write the pipeline themselves.

The per-agent model configuration via agent_models.yaml is genuinely useful — you can point the cheap hypothesis work at a small model and the code generation at something stronger, which keeps costs sane. The LangGraph-based state graph is a reasonable choice for orchestrating a multi-step workflow with human-in-the-loop checkpoints (the hypothesis approval step). Agent instructions live in per-agent AGENT.md files, which makes the system's behavior auditable without reading Python. MCP integration for filesystem, GitHub, and web search means agents can reach outside the local environment without custom glue code.

The setup is brittle in practice: you need Conda, a Chromedriver binary, and up to five different API keys before anything runs, and none of that is automated — one missing env var silently degrades capabilities rather than failing clearly. The README openly lists 'NoteTaker Efficiency Improvement' and 'Overall Runtime Optimization' as known open issues, which is honest but signals the system can be slow and context-heavy for non-trivial datasets. There's a prominent warning that the agent may modify your source data, with backup recommended but not enforced, which is a real footgun. The frontend directory exists but is empty, so the 'platform' framing in the README is aspirational.

View on GitHub →