// the find
thiswillbeyourgithub/wdoc
Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, advanced RAG, advanced summaries, scriptable, etc
wdoc is a CLI/library RAG tool for querying and summarizing large, heterogeneous document collections — PDFs, audio, YouTube, Anki decks, EPUBs, web pages, and more. It uses a two-LLM pipeline: a cheap evaluator to filter retrieved chunks, a strong model to answer, then semantic clustering before combining. Built by a psychiatry resident who needed this to actually work on tens of thousands of medical documents.
The dual-LLM query pipeline (cheap filter → strong answerer → semantic-clustered combiner) is genuinely more careful than the typical embed-then-generate pattern — it handles large corpora without just hoping the top-k chunks are sufficient. PDF handling with 15 loaders and heuristic scoring is a real differentiator; PDF parsing quality varies wildly and most tools just pick one loader and live with the failures. The modular extras system (`wdoc[youtube,audio,anki]`) means you're not forced to pull in torch and ffmpeg just to query a text file. Binary FAISS with zlib compression is a practical win for large indexes — ~32x smaller embeddings with negligible accuracy loss is worth the complexity.
LangChain as a core dependency is a long-term liability — it has historically had breaking changes, over-abstracts simple things, and makes debugging harder when the chain misbehaves; the codebase acknowledges wanting to migrate to langgraph but that's also LangChain. The `private_mode` socket overloading approach to prevent data leaks is fragile and not something you'd trust in a genuinely sensitive environment. Test coverage is thin by the author's own admission — the roadmap lists 'add more tests' as most urgent, and a tool doing real production work on medical documents needs better than that. The Python API is explicitly flagged for rewrite; using it as a library today means coupling to internals that will change.