// the find

pymupdf/PyMuPDF

★ 9,996 · Python · AGPL-3.0 · updated Jun 2026

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

PyMuPDF is a Python binding for the MuPDF C library, giving you programmatic control over PDFs and a handful of other document formats — text extraction, annotation, redaction, rendering, table detection, form filling. It's maintained by Artifex (the company behind MuPDF and Ghostscript), so it's not a weekend project. The AI angle is real: `pymupdf4llm` produces structure-aware Markdown from PDFs without a GPU, which is genuinely useful for RAG pipelines.

- Speed is not marketing — MuPDF is a proper C engine and PyMuPDF's text extraction and page rendering benchmarks beat pure-Python alternatives by an order of magnitude or more. For batch processing thousands of PDFs this matters.

- The `get_text('dict')` output gives you per-span font name, size, color, and bounding box, which is the kind of detail you need for layout-aware extraction that most PDF libraries can't give you without a lot of post-processing.

- Reusing a `TextPage` object cuts repeated extraction cost by 50–95% — the library exposes the right abstraction rather than hiding it and forcing you to pay the parse cost twice.

- Runs fully local with no outbound calls, which matters when you're processing legal, medical, or financial documents. No telemetry, no license callbacks, works air-gapped.

- AGPL v3 license is a hard blocker for any proprietary application unless you pay Artifex for a commercial license. This isn't buried — it's the first thing to check, and a lot of teams discover it late.

- No multithreading at all — MuPDF's thread safety guarantees are partial and PyMuPDF doesn't paper over this. You have to reach for multiprocessing and manage inter-process overhead yourself, which is awkward for async server workloads.

- Office document support (DOCX, XLSX, PPTX) is behind a paid Pro license with a 3-page evaluation cap. The free tier is PDF-only, which the README doesn't make obvious until you're already several sections in.

- OCR via Tesseract requires either a system install and PATH setup or manually pointing at tessdata — the integration is functional but not smooth, and the error messages when tessdata isn't found are confusing enough that it has its own FAQ entry.

View on GitHub → Homepage ↗