// the find

CatchTheTornado/text-extract-api

★ 3,104 · Python · MIT · updated Dec 2025

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

A self-hosted FastAPI service that converts PDFs, images, and Office documents to Markdown or structured JSON using a pluggable OCR backend (EasyOCR, llama3.2-vision, minicpm-v, or an external marker-pdf server) with optional Ollama LLM post-processing. Targets teams that need document extraction without sending data to a cloud provider. Redis + Celery handle async task queuing.

Genuinely useful strategy abstraction — swapping between EasyOCR, vision LLMs, and remote marker-pdf requires only a query parameter change, not code changes. The Redis caching layer means repeated extractions of the same document skip the expensive OCR pass entirely. Docker Compose bundles everything (FastAPI, Celery worker, Redis, Ollama) so the first-run experience is actually plausible for a self-hosted tool. Storage profiles (local, S3, Google Drive) are YAML-configured and properly swappable without touching application code.

The local-run setup is a mess — you need Docker for Redis even in the 'no Docker' path, Celery must be started separately on Mac, and the venv instructions are scattered across three different README sections. No authentication on the API whatsoever; if you expose port 8000 anywhere beyond localhost you've got an open document extraction endpoint. The LLM post-processing relies entirely on whatever Ollama model you pull, with no validation that the model output is actually valid JSON before returning it to the caller — prompt engineering is left entirely to the user. Last push was December 2025, and several referenced strategies (docling) appear in the directory tree but have no README coverage, suggesting the feature set outpaced the docs.

View on GitHub → Homepage ↗