// the find
benbalter/word-to-markdown
A ruby gem to liberate content from Microsoft Word documents
A Ruby gem that converts Word documents to Markdown by shelling out to LibreOffice's headless mode, then post-processing the HTML output with Nokogiri. Useful for anyone migrating legacy content pipelines away from Word — government documentation workflows are explicitly the target audience.
LibreOffice as the conversion backend is the right call — it handles the format complexity so the gem doesn't have to reinvent Word parsing. Implicit heading detection (inferring headings from font-size ratios when explicit styles weren't applied) is a genuinely useful heuristic for real-world badly-styled documents. The fixture-per-feature test structure is solid: each edge case gets its own .docx, so regressions are easy to pin down. Docker setup means you don't have to fight LibreOffice installation on your local machine just to try it.
LibreOffice as a runtime dependency is a heavy ask — it's 300MB+ and version-sensitive; CI on Windows via AppVeyor was already showing its age and the setup cost on servers is real. Conversion quality degrades fast on documents with complex formatting: tables inside lists, custom styles, embedded objects, and tracked changes are all best-effort. The project has been in low-maintenance mode for years (the author moved on, most recent activity is Dependabot bumps), so if LibreOffice changes its HTML output format you're on your own. No streaming or async support — the whole file is processed synchronously via subprocess, which blocks on large documents.