// the find
HKUDS/RAG-Anything
"RAG-Anything: All-in-One RAG Framework"
RAG-Anything is a multimodal RAG framework built on top of LightRAG that handles PDFs, Office docs, images, tables, and equations through a unified pipeline. It uses MinerU for document parsing, vision models for image captioning, and builds a knowledge graph across all content types. Target audience is researchers and developers who need to query mixed-content documents (academic papers, technical docs, financial reports) rather than plain-text corpora.
- The `insert_content_list` API is genuinely useful — you can bypass the built-in parser entirely and feed pre-parsed content from any external tool, avoiding lock-in to MinerU or their other parsers.
- Multiple parser backends (MinerU, Docling, PaddleOCR) with a clear plugin pattern for custom processors means you're not stuck if one parser handles your format poorly.
- VLM-enhanced query mode that automatically pulls images out of retrieved context and sends them to a vision model is a real differentiator over basic text-only RAG systems.
- Test suite is non-trivial with async tests, resilience tests, and integration tests for different parsers — more than most academic repos bother with.
- Heavy dependency chain with MinerU as a mandatory core dep pulls in a lot of ML baggage (likely torch, detectron2, etc.) even if you only want text processing. There's no lightweight install path.
- Office document support (DOCX, PPTX, XLSX) silently requires LibreOffice to be installed and callable — no graceful degradation, just runtime failures if it's missing. The `office` extras group is literally empty.
- The knowledge graph is inherited from LightRAG which uses LLM calls to extract entities and relationships at index time, meaning indexing costs can spiral quickly for large document sets and there's no clear cost estimation tooling.
- No built-in chunking strategy for long documents beyond delegating to LightRAG defaults — the `split_by_character` parameter exists but there's no guidance on what settings actually work well for multimodal content where splitting mid-table or mid-equation destroys context.