// the find
microsoft/graphrag
A modular graph-based Retrieval-Augmented Generation (RAG) system
GraphRAG is Microsoft Research's pipeline for building knowledge graphs from unstructured text, then using those graphs to answer questions that require synthesizing information across many documents — the classic RAG failure mode. It's aimed at researchers and engineers who need to query large private document corpora where a single retrieved chunk won't contain the answer.
The global search mode is the actual differentiator: it can answer 'what are the main themes across all these documents?' by aggregating community summaries, something flat vector search genuinely cannot do. The repo has been actively maintained through three major versions with real migration tooling, which is rare for a research project. The modular package structure (graphrag-llm, graphrag-chunking, graphrag-cache as separate pypi packages) means you can swap components without forking the whole thing. Prompt auto-tuning based on your actual data is a practical feature — off-the-shelf entity extraction prompts usually work poorly on domain-specific text.
Indexing cost is brutal and they say so themselves: extracting entities and relationships from every document chunk means many LLM calls per document, not one. On a corpus of any real size this is hundreds of dollars before you query anything. The output is also brittle across minor version bumps — the docs tell you to re-run `graphrag init --force` between minor versions, which overwrites your prompts. The community detection (Leiden algorithm) adds a hard dependency on the graph structure being meaningful, and if your documents don't have strong entity relationships the graph ends up noise. Local search still falls back to vector similarity for entity-grounded questions, so you're not escaping embedding quality problems, just layering graph context on top.