// the find

yobix-ai/extractous

★ 1,754 · Rust · Apache-2.0 · updated Dec 2024

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

Extractous extracts text and metadata from documents (PDF, Office, email, images) without calling an external API or running a JVM. The core is Rust; Apache Tika is compiled to a native shared library via GraalVM ahead-of-time compilation and called over JNI, which is what gets them broad format support without the usual Java overhead. Primary audience is Python/Rust developers building RAG pipelines or ETL that currently depend on unstructured-io.

The GraalVM AOT trick is genuinely clever — they get Tika's 1000-format coverage without shipping a JVM or spawning a server process. The 18x speed and 11x memory benchmarks over unstructured-io are plausible given that unstructured-io is famously slow and leaky. The streaming API (returns a std::io::Read) is the right design for large files — you don't blow memory loading a 500MB PDF into a string. Python bindings bypass the GIL for multi-threaded extraction, which matters for CPU-bound batch workloads.

Last commit was December 2024 — six months of silence on a project this early is a yellow flag, especially with JavaScript bindings listed as 'upcoming' and never shipped. The JNI boundary between Rust and native Tika is a reliability risk: panics or OOMs in the native layer can take down your process in ways that are hard to debug. OCR depends on a system-installed Tesseract, which is an invisible deployment dependency that will bite you in containers unless you're careful with your Dockerfile. No Windows wheels on PyPI — the CI patches linux and macos wheels but Windows is conspicuously absent from the release pipeline.

View on GitHub →