// the find

VectifyAI/PageIndex

★ 32,928 · Python · MIT · updated Jun 2026

📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG

PageIndex replaces vector similarity search with LLM-driven tree traversal over structured document indexes. You build a hierarchical table-of-contents from a PDF, then ask the LLM to reason its way through that tree to find relevant sections. Aimed at anyone doing RAG over long professional documents where semantic chunking fails — financial filings, legal docs, technical manuals.

The core insight is sound: for structured documents, traversing a semantic tree is more precise than cosine similarity over arbitrary chunks, and the FinanceBench 98.7% number is a credible benchmark result, not a toy demo. The JSON tree output is clean and LLM-friendly — node summaries with page ranges mean the LLM can prune branches without reading full content. Multi-LLM support via LiteLLM is a practical choice that avoids vendor lock-in. The agentic example using OpenAI Agents SDK shows the pattern actually works end-to-end, not just in theory.

The open-source package is explicitly second-class — complex PDFs require their paid cloud OCR pipeline, which means the repo is essentially a demo of an idea with the real implementation behind a paywall. Index construction is LLM-heavy: building the tree for a large document makes many API calls, which is slow and expensive, and there's no caching or incremental update story for documents that change. The retrieval latency profile is unpredictable — tree depth determines how many sequential LLM calls you need, so a deep document with a narrow query could be slower than a vector search by an order of magnitude. No benchmarks outside the single FinanceBench result, so performance on non-financial document types is uncharacterized.

View on GitHub → Homepage ↗