// the find
ucbepic/docetl
A system for agentic LLM-powered data processing and ETL
DocETL is a declarative map-reduce framework for running LLM pipelines over large document collections. You describe each processing step in natural language, and it handles parallelization, caching, and automatic prompt/model optimization. It's for data engineers and researchers who want to process thousands of documents with LLMs without hand-wiring every API call and retry loop.
The MOAR optimizer is the real differentiator — it runs multi-objective search over prompt rewrites, model swaps, and operation decompositions to trade off cost vs. accuracy, which saves a lot of manual tuning. The operator set is genuinely useful beyond basic map/filter: resolve (entity deduplication), gather (context injection across chunks), and cluster handle the messy parts of document processing that most frameworks ignore. Dual interface (Python fluent API + YAML) means it works for both one-off notebooks and reproducible pipelines without a context switch. Backed by published VLDB papers with benchmarks, so the optimization claims are at least testable, not just marketing.
Cost visibility is per-pipeline-run only; there's no built-in budget cap that stops a runaway optimize pass before it burns $50 — you have to set rate limits yourself and hope. The resolve operator does pairwise LLM comparisons for entity deduplication, which is O(n²) and will quietly become very expensive on large datasets if you don't read the docs carefully. State management between pipeline runs relies on a file cache that isn't atomic, so interrupted runs can leave partial state that's hard to reason about. The DocWrangler UI is a separate Next.js app that needs its own setup and a running FastAPI server, so 'try it locally' is more involved than the README implies.