// the find

microsoft/markitdown

★ 151,082 · Python · MIT · updated May 2026

Python tool for converting files and office documents to Markdown.

MarkItDown converts Office documents, PDFs, images, audio, and misc formats to Markdown, primarily for feeding content into LLMs. It's a document-to-text pipeline tool, not a document renderer — the output is meant to be machine-readable, not pretty. Squarely aimed at developers building RAG pipelines, document ingestion workflows, or LLM preprocessing steps.

The optional-dependency model is well thought out — you install only what you need (pptx, pdf, xlsx separately), which avoids dragging in a 2GB dependency tree for a use case that only touches Word files. The plugin architecture is a real extensibility point with a published sample and a discoverable namespace convention (#markitdown-plugin). The security section actually acknowledges the SSRF-adjacent risks of a convert() that accepts URIs and recommends narrower APIs — that's more than most similar tools bother with. Azure Content Understanding integration handles the genuinely hard cases (video, scanned docs with structured field extraction) where local converters just can't compete.

PDF output quality from the built-in converter is well-documented to be mediocre — you need Azure Document Intelligence or the OCR plugin for anything that isn't a text-layer PDF, which means cloud dependency and billing for the common case. The LLM image description feature is hardcoded to the OpenAI client interface, so if you're already using Anthropic or Gemini you have to wrap it yourself. The monorepo structure (packages/markitdown, packages/markitdown-mcp, packages/markitdown-ocr as separate packages) adds friction for contributors who just want to fix a bug and aren't sure which package owns what. There's no streaming output — for large documents you wait for the full conversion before getting any text back, which matters in pipeline contexts where you'd want to start processing chunks early.

View on GitHub →