// the find

Zipstack/unstract

★ 6,650 · Python · AGPL-3.0 · updated Jun 2026

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

Unstract is a document-to-JSON extraction platform: you define what you want from PDFs, scans, and images using natural language in a UI called Prompt Studio, then deploy that schema as a REST API or ETL pipeline. It's aimed at teams in finance, insurance, and compliance who need to reliably pull structured data from messy, varying documents at scale without writing per-template parsers.

Multi-provider LLM support is genuinely useful here — you can point it at Ollama for local/private data or swap to Bedrock for compliance reasons without touching your extraction schema. The 'write a prompt once, handle vendor document variations' approach actually solves a real problem that brittle regex pipelines do not. The connector ecosystem (S3, GCS, Snowflake, BigQuery, Redshift) means it slots into existing data infrastructure rather than requiring a new pipeline. CI hygiene is solid — SonarCloud gates, pre-commit hooks, uv for dependency management — signs this isn't abandonware.

AGPL-3.0 is a hard stop for anyone building a commercial product — the interesting enterprise features (dual-LLM verification, human-in-the-loop review, SSO) are behind a paid tier, so you get the license risk without getting the production-critical features. The self-hosted stack is genuinely heavy: Django backend, Celery workers, FastAPI platform service, RabbitMQ, Redis, and PostgreSQL all running together is a lot of infrastructure to operate for what is essentially 'POST document, GET JSON'. LLMWhisperer appears to be their preferred text extractor and is their own paid SaaS — the open-source alternatives (Unstructured.io, LlamaIndex Parse) work but the path of least resistance leads you toward a second vendor dependency. Table extraction from PDFs is a known hard problem and the README says nothing about how it handles it, which is exactly where document extraction pipelines fall apart in production.

View on GitHub → Homepage ↗