finds.dev← search

// the find

docling-project/docling

★ 62,254 · Python · MIT · updated Jun 2026

Get your documents ready for gen AI

Docling is a document parsing library from IBM Research that converts PDFs, Word docs, PowerPoints, spreadsheets, HTML, EPUB, email files, and a dozen other formats into a unified internal representation, then exports to Markdown, JSON, HTML, or DocTags. It's primarily aimed at teams building RAG pipelines or document-heavy AI workflows who need something more serious than PyMuPDF string extraction. The PDF path is the main event — it reconstructs reading order, table structure, and formulas rather than just dumping text.

The PDF understanding pipeline is genuinely differentiated: layout detection, reading order reconstruction, and table structure recovery run as separate ML stages rather than a single regex pass, which means complex academic papers and financial reports come out sensibly structured. The DoclingDocument intermediate representation is a clean abstraction — you get a typed document tree you can traverse programmatically before committing to an export format. Local execution is a first-class concern, not an afterthought: air-gapped deployments with on-device models work out of the box. The plugin architecture for OCR backends (Tesseract, EasyOCR, RapidOCR) and VLM inference engines (vLLM, MLX, HF Transformers, KServe) means you can swap components without rewriting the pipeline.

The dependency footprint is enormous — pulling in docling also pulls in multiple ML model weights on first run, which makes Docker image sizes painful and cold starts on serverless nonstarters. The `docling-parse` PDF backend is a compiled C++ extension with separate versioning (v2, v4 are different binaries), meaning OS/arch incompatibilities surface in ways that are annoying to debug. Chart understanding and audio/ASR support feel bolted on rather than integrated — they live in separate pipeline classes with different configuration objects, so a mixed document (PDF with embedded audio) isn't really handled. The IBM provenance is double-edged: the GraniteDocling VLM is useful, but the project's governance and release cadence are tied to IBM Research priorities, and the CHANGELOG shows occasional breaking changes to pipeline options without deprecation warnings.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →