Applying graph RAG to a real product

Diagrama de un property graph con nodos y aristas etiquetados, estructura de datos que sustenta las variantes de RAG sobre grafos como GraphRAG o LightRAG

Over the last year and a half, the idea of enriching traditional RAG with graph structures has moved from lab topic to real product. The push came partly with Microsoft Research’s GraphRAG release in mid-2024, but the pattern has evolved much further: today there are lightweight variants (LightRAG, HippoRAG), open implementations running on Neo4j or Memgraph, and templates ready to try over a weekend. In March 2025 the market is mature enough to start asking when it pays off and when it doesn’t.

This post sums up what I’ve seen work when applying graph RAG to concrete products, and the typical problems that appear moving from demo to daily use. It doesn’t dive into detailed theory; it does dive into practical architecture decisions.

Why graphs add over vector RAG

Classic RAG indexes text fragments as vectors and retrieves by similarity. It works very well when the user’s question is answered by one or two concrete text pieces: “what’s the return policy?”, “what does the manual say about error 503?”. The problem appears when the question requires understanding relationships between scattered entities: “give me the decisions Ana made on Project Apollo last quarter and who else was involved”.

There vectors lose footing. Every relevant fragment is scattered across different documents, and pure similarity doesn’t capture the relation among people, projects, decisions, and dates. The graph allows modelling that network explicitly: nodes for people, projects, artifacts, decisions; edges for the links among them, extracted from texts. When the question arrives, the system can traverse the graph, bring relevant nodes, and only then hand the LLM the relevant subset.

The second graph contribution is multi-hop reasoning. A question like “which customers have had incidents related to library X?” requires connecting incidents → components → libraries. That’s done poorly with vectors and very well with a graph query. The LLM doesn’t have to make that jump; you serve it done.

When it pays off and when it doesn’t

The decision to add a graph isn’t taken because the concept is trendy. It pays off in two clear scenarios.

First, when the domain has well-defined entities and relationships users ask about. Technical support with customers, products, versions, incidents. Internal research with projects, people, decisions, artifacts. Compliance with policies, processes, controls, audits. In all these cases the graph reflects how people think about the domain, not an imposed abstraction.

Second, when the corpus is stable. A graph over documents changing every day requires a robust update pipeline, and that pipeline consumes a significant slice of cost. If documents update weekly or monthly, the graph is more manageable.

It doesn’t pay off when the corpus is small (fewer than a few thousand documents), because vector RAG with good reranking already solves it. Nor when questions are monolithic (“what does the contract say in clause X?”) and there’s no real need to jump across entities. And much less when the team has no graph-modelling experience; the learning cost on the fly is high and initial failures sink the project.

How to assemble it without over-engineering

The usual flow has four pieces.

First, entity and relationship extraction. A strong LLM (Claude 3.7 Sonnet, GPT-4o, Gemini 2.0) processes the corpus in batches and returns triples (subject-relation-object) with metadata. This sounds expensive and isn’t that much: for a typical corpus of tens of thousands of documents, the API bill in 2025 sits well below what many imagine, if done in batch with well-thought prompts. Prompt quality determines graph quality, so it’s worth iterating here before scaling.

Second, consolidation. The LLM extracts many entities that are actually the same person or same project written differently. Consolidating needs an embedding-similarity step plus domain-specific rules. Without consolidation, the graph becomes a forest of near-duplicate nodes that doesn’t navigate well.

Third, storage. Neo4j remains the default choice for property graphs, but there are lighter alternatives (Memgraph, KuzuDB) and also implementations on PostgreSQL with extensions. My advice, unless very large scale is needed, is to start with whatever the team already has installed. For Docker-based projects, KuzuDB is surprisingly capable and avoids an extra piece in the stack.

Fourth, query time. There are variants. Microsoft GraphRAG generates node communities beforehand and uses them as hierarchical summary. LightRAG takes a more sober approach, without precomputing communities. HippoRAG emphasises hippocampus-inspired knowledge structure. In practice, for a first project, starting with the LightRAG pattern is usually faster to stand up and sufficient to validate the hypothesis.

Problems that appear in production

There are things you don’t see in the demo that hit after two weeks of real use.

First is document updates. A document that changes can’t simply be re-extracted; you need to invalidate triples that came from it and regenerate the new ones, keeping consistency with entities already present. Without a pipeline with stable per-document identifiers, the graph slowly corrupts.

Second is verification. Triples extracted by the LLM are probabilistic: sometimes the relation doesn’t exist in the text, or a negation is misread. A production system needs at least human sampling to measure error rate, and sometimes a second automatic verification pass with a model reviewing dubious triples.

Third, often underestimated, is latency. A complete query (subgraph retrieval + expansion + generation) can easily cross one second. When the product has synchronous chat UX, that’s perceptible. You optimise graph query with specific indexes and cache frequent-question subgraphs.

Fourth is operating cost. Extracting a graph is a high one-time expense, but keeping it fresh is recurring. If the corpus receives 1000 new documents a week, the extraction pipeline must run without breaking and without choking the API quota. I’ve seen this break several times in productions that hadn’t sized it.

A hint on evaluation

Evaluating graph RAG is harder than evaluating vector RAG. Measuring relevance of retrieved fragments isn’t enough because the retrieved unit is a subgraph. What has worked for me is defining a set of synthetic domain questions with expected answers, measuring whether the retrieved subgraph contains the nodes and relations needed to answer, and then measuring the generated answer’s quality. Two metrics, not one, both matter.

The comparison with a pure vector RAG baseline should always be done. Sometimes the graph improves a lot on multi-hop questions but worsens on simple ones because of extra retrieval cost. Measuring both types separately is the only honest way to decide whether the graph is worth it for the concrete product.

When it pays off

The balance, seeing projects that reached production and others that didn’t, is that graph RAG is valuable when the domain is inherently relational, the corpus is stable, and the team has enough experience to sustain the extraction pipeline. In those cases the quality jump on complex questions is real and users notice.

Outside those scenarios, the maintenance overhead doesn’t pay off. A good vector RAG with reranking and explicit source citation remains the right choice for most products I’ve seen. Adding graph without real need is one of the classic vices of AI engineering in 2024-2025: it’s done because it sounds good in the client pitch, not because it solves a user problem.

My practical advice is to spend a week building a prototype, evaluate it against the vector baseline with real product questions, and decide with numbers. If the improvement is marginal, it stays as a card for later; if it’s clear, you build the pipeline carefully and take it to production. I’ve never seen it work well as an a-priori decision without experimenting.

Entradas relacionadas