finds.dev← search

// the find

benbalter/word-to-markdown

★ 1,550 · Ruby · MIT · updated Oct 2025

A ruby gem to liberate content from Microsoft Word documents

A Ruby gem that converts Word documents to Markdown by shelling out to LibreOffice's headless mode, then post-processing the HTML output with Nokogiri. Useful for anyone migrating legacy content pipelines away from Word — government documentation workflows are explicitly the target audience.

LibreOffice as the conversion backend is the right call — it handles the format complexity so the gem doesn't have to reinvent Word parsing. Implicit heading detection (inferring headings from font-size ratios when explicit styles weren't applied) is a genuinely useful heuristic for real-world badly-styled documents. The fixture-per-feature test structure is solid: each edge case gets its own .docx, so regressions are easy to pin down. Docker setup means you don't have to fight LibreOffice installation on your local machine just to try it.

LibreOffice as a runtime dependency is a heavy ask — it's 300MB+ and version-sensitive; CI on Windows via AppVeyor was already showing its age and the setup cost on servers is real. Conversion quality degrades fast on documents with complex formatting: tables inside lists, custom styles, embedded objects, and tracked changes are all best-effort. The project has been in low-maintenance mode for years (the author moved on, most recent activity is Dependabot bumps), so if LibreOffice changes its HTML output format you're on your own. No streaming or async support — the whole file is processed synchronously via subprocess, which blocks on large documents.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →