RAG in Production: Patterns That Work and Those That Don’t

Bibliotecas con estanterías de libros digitales representando recuperación de información

After two years of RAG in production, patterns separating successful from disappointing deployments are clear. This article compiles lessons — working techniques and anti-patterns to avoid — for teams with basics and wanting to scale.

Chunking: Not Simple

Anti-pattern: naive chunking (500-token fixed splits).

Working patterns:

  • Semantic chunking: split by semantic boundaries, not tokens.
  • Hierarchical chunks: document → sections → paragraphs.
  • Strategic overlap: 10-20% overlap between chunks to preserve context.
  • Metadata-rich: tags, author, date per chunk.

Libraries: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter, custom.

Vector-only loses exact-match cases (SKUs, codes, names). Combination:

  • BM25 (keyword) + vector (semantic).
  • Reciprocal Rank Fusion (RRF): merge results.
  • Result: significantly better recall.

Elastic, OpenSearch, Weaviate, Qdrant offer native hybrid.

Re-Ranking

Production patterns:

  1. Retrieval: top-100 with embeddings.
  2. Re-rank: cross-encoder (Cohere Rerank, BGE-reranker) → top-10.
  3. LLM generation with top-10.

15-30% precision improvement.

Query Transformation

Bad user queries → bad retrieval. Techniques:

  • HyDE: generate hypothetical answer, embed that, search.
  • Query expansion: synonyms, related terms.
  • Sub-queries: decompose complex query into multiple.
  • Query classification: route per type.

LLM can do all pre-retrieval.

Metadata Filtering

To pre-filter retrieval space:

results = vectorstore.similarity_search(
    query,
    filter={"department": "engineering", "date": {"$gt": "2024-01-01"}}
)

Faster + more relevant than searching everything and filtering after.

Evaluation

Continuous eval essential:

  • Golden dataset: 100-500 curated query-answer pairs.
  • Ragas metrics: faithfulness, relevance, precision.
  • Periodic human review.
  • A/B testing when changing components.

Without eval, improvements are placebo.

Anti-Patterns

1. Context Stuffing

Throwing 20 chunks @ LLM. Result: “lost in middle”, hallucinations.

Better: relevant top-5, force citations.

2. Embedding Staleness

Recreate embeddings when changing model? No. Maintain consistency or complete reindex.

3. No Caching

Same repeated queries = same LLM costs. Cache:

  • Embedding cache: Redis by query text.
  • Result cache: by query + filters.

4. Ignoring Latency

Production RAG: <2s target. Retrieve + rerank + LLM can exceed.

Optimisation:

  • Parallelise retrieval + LLM prefetch.
  • Streaming responses.
  • Aggressive cache.

5. No Observability

Log query + retrieved chunks + answer. Weekly review. Without data, iteration is blind.

Architecture Pattern

Typical production stack:

[User query]
    ↓
[Query analysis + transformation]
    ↓
[Hybrid search] → [vector DB + keyword]
    ↓
[Re-ranker]
    ↓
[LLM with context + citations]
    ↓
[Response]

Cache at each step if applicable.

Vector DB Choice

For production:

  • pgvector: if already using Postgres (most).
  • Qdrant: purpose-built, great performance.
  • Weaviate: native hybrid, more features.
  • Pinecone: managed, simple, pricier.
  • Elasticsearch/OpenSearch: mature hybrid.

Decision: pgvector default; specialised if scale requires.

Costs

  • Embeddings: one-time + updates.
  • LLM generation: per query.
  • Vector DB: storage + queries.
  • Re-ranker: per query.

Optimise: cache hits, smaller models where sufficient, batch processing.

Streaming Responses

# Retrieve first
chunks = retriever.get_relevant_documents(query)

# Stream LLM response
for token in llm.stream(prompt_with_context):
    yield token

Significantly better perceived latency.

Production Monitoring

Track:

  • Retrieval recall: queries that retrieved relevant doc (requires ground truth).
  • Answer accuracy: via Ragas.
  • p50/p95/p99 latency.
  • Token usage per query.
  • Cost per query.
  • User satisfaction: thumbs up/down.

Continuous dashboard to see drift.

Iteration

Production RAG isn’t “deploy + forget”:

  • Add new docs continuously.
  • Re-evaluate periodically.
  • Update chunking strategy based on failures.
  • Swap components (LLM, embeddings, re-ranker) as better options emerge.

Conclusion

Production RAG is engineering discipline, not magic. Patterns are known; execution varies. Successful teams invest in evaluation, observability, iteration. Failing teams deploy “basic RAG” and expect perfect. For new projects, start with these patterns from day 1 — cheaper than rearchitecting. For existing systems, audit against this list and optimise priorities.

Follow us on jacar.es for more on RAG patterns, production ML, and AI architectures.

Entradas relacionadas