// the find
ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Dedoc is a Python library from the ISPRAS research institute that extracts structured content (headings, tables, formatting metadata) from a wide range of document formats — DOCX, PDF, scanned images, HTML, XLS, and more — into a unified tree representation. It's primarily aimed at NLP preprocessing pipelines where you need to feed structured text into downstream analysis rather than raw bytes. Think document ingestion for search indexes, compliance systems, or RAG pipelines.
The auto-detection of PDF text layer quality is genuinely useful — it classifies whether a PDF has a real text layer or is just scanned, then routes accordingly without you having to decide. The annotation system is thorough: you get bounding boxes, font size, bold/italic, indentation, and spacing preserved at the character span level, not just paragraph level. Format coverage is unusually wide for a single library — the same API handles a DOC from 2003 and a multi-column scanned image. Docker image is available on Docker Hub and works without any setup on your end.
Ubuntu 20+ is the only supported platform for pip installs; on anything else you're in Docker or you're debugging dependency hell. The scanned document pipeline hard-requires tables to have explicit visible borders — any borderless or partial-border table gets silently skipped, which will bite you on real-world financial or government documents. At 712 stars for a project that's been around since at least 2022, adoption is thin, which means sparse community answers and slow bug fixes. The Java dependency (tabby, shipped as bundled JARs) is an odd choice that adds a hidden JRE requirement most Python shops won't anticipate.