// the find
Ontos-AI/knowhere
Knowhere extracts, parses, and outputs structured chunks ready for AI Agents and RAG.
Knowhere is a document ingestion pipeline that sits between raw files (PDFs, Office docs, images) and AI agents — it parses, reconstructs document hierarchy, and builds a navigation graph so retrieval goes beyond flat vector lookup. It's aimed at teams building RAG pipelines who've hit the ceiling on naive chunking approaches and want something that preserves section context and cross-document relationships.
The tree-based hierarchy reconstruction is the actual differentiator here — most chunking tools flatten everything into a sequence and lose the logical structure; Knowhere keeps section ancestry attached to each chunk. The agentic retrieval layer (navigate section trees + graph links rather than just k-NN) maps well to how documents are actually structured, which matters for long-form PDFs. The benchmark numbers (+36% first-try accuracy) are self-reported but the methodology is at least described, and the demo documents include real financial filings you can reproduce. The project structure is clean — split API + worker, Alembic migrations tracked, `uv` for deps, lint/typecheck targets all wired up.
The dependency chain is heavy and opinionated in ways that could bite you: MinerU for PDF parsing requires its own API key, and DeepSeek + Qwen-VL are baked in as defaults — swapping to OpenAI is documented as 'set env vars' but it's not clear how deeply those specific providers are assumed elsewhere. The self-hosting story (separate repos for API, worker, dashboard, and the Docker Compose stack) means you're coordinating four repositories to deploy one product, which is friction most teams will underestimate. The benchmark is internal-only with no reproducible test set published, so the +36% claim is hard to verify. Open-sourced just six weeks ago, so the community is thin and the issue tracker won't tell you much about production edge cases yet.