Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Arquitectura Inteligencia Artificial

RAG in Production: Patterns That Work and Those That Don’t

RAG in Production: Patterns That Work and Those That Don’t

Actualizado: 2026-05-03

After two years of RAG in production, the patterns that separate successful from disappointing deployments are clear. This article compiles the lessons — working techniques and anti-patterns to avoid — for teams that already have a basic RAG running and want to scale it reliably.

Key takeaways

  • Naive 500-token fixed-split chunking is the most common error and the one with the highest impact on retrieval quality.
  • Hybrid search (BM25 + vector) significantly improves recall versus vector-only — especially for proper nouns, codes, and exact terms.
  • A re-ranking pipeline reduces the context sent to the LLM from top-100 to top-10 with a 15-30% precision improvement.
  • Without a golden dataset and continuous evaluation metrics, improvements are placebos — you cannot optimise what you don’t measure.
  • Production RAG is an engineering discipline, not magic: the patterns are known, the execution varies.

Chunking: not simple

Anti-pattern: naive chunking with fixed 500-token splits, ignoring document structure.

Working patterns:

  • Semantic chunking: split by semantic boundaries, not token count.
  • Hierarchical chunks: document → sections → paragraphs, enabling retrieval at multiple granularity levels.
  • Strategic overlap: 10-20% overlap between chunks to preserve context at boundaries.
  • Metadata-rich: tags, author, date, section per chunk — enables effective filtering.

Libraries: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter, custom implementations depending on document type.

Hybrid search: BM25 + vector

Vector-only search fails on exact-match cases: product SKUs, proper nouns, error codes, infrequent technical terms. The combination:

  • BM25 (keyword, exact match) + vector (semantic).
  • Reciprocal Rank Fusion (RRF): merging results from both paths.
  • Result: significantly better recall than either path alone.

Elasticsearch, OpenSearch, Weaviate and Qdrant offer native hybrid search. For vector database selection context, see pgvector: maturity in 2024.

Re-ranking: from 100 candidates to 10 relevant ones

The standard pipeline in mature production:

  1. Retrieval: top-100 with embeddings (high recall, moderate precision).
  2. Re-ranking: cross-encoder (Cohere Rerank, BGE-reranker) → top-10 (high precision).
  3. LLM generation with the top-10 in context.

Typical precision improvement: 15-30%. Re-ranking adds latency (50-150ms), but the trade-off is worth it for critical responses.

Query transformation: attacking the root of bad retrieval

Poorly formulated user queries produce bad retrieval. Techniques:

  • HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, embed it, and search with it — the answer embedding is closer to relevant document embeddings than the question embedding.
  • Query expansion: add synonyms and related terms.
  • Sub-queries: decompose complex questions into multiple sub-searches.
  • Query classification: route to different indexes or strategies depending on question type.

The LLM can do all these transformations in a single pre-retrieval step.

Metadata filtering: pre-filter the search space

python
results = vectorstore.similarity_search(
    query,
    filter={"department": "engineering", "date": {"$gt": "2024-01-01"}}
)

Filtering before retrieval is faster and more relevant than searching the full index and filtering after. Well-designed metadata at chunking time is the foundation — you cannot filter by what was not indexed.

Continuous evaluation: without this, everything else is placebo

Golden dataset: 100-500 manually curated query-answer pairs, representative of real use cases.

Ragas metrics: – Faithfulness: is the answer supported by the retrieved context? – Answer relevance: does the answer answer the question? – Context precision: are the retrieved chunks relevant?

Process:

  1. Evaluate against golden dataset before any change.
  2. A/B testing when changing components (embedding model, chunking strategy, re-ranker).
  3. Periodic human review of a production answer sample.
  4. Continuous metrics dashboard — retrieval quality drift is silent.

Without evaluation, improvements are placebos. Most teams reporting “improving the RAG” have no baseline metrics.

Most common anti-patterns

1. Context stuffing

Sending 20 chunks to the LLM produces the “lost in the middle” effect: the model ignores relevant information in the middle of the context. Better: relevant top-5, force explicit citation.

2. Embedding staleness

Changing the embedding model without re-indexing the corpus produces silent inconsistencies. Better: full re-index or strict model consistency per index version.

3. No caching

The same repeated queries generate the same LLM cost repeatedly:

  • Embedding cache: Redis by query text — the embedding of “how much does the basic plan cost?” is always the same.
  • Result cache: by query + active filters.

4. Ignoring latency

The production target is <2s total. Retrieve + re-rank + LLM can easily exceed it:

  • Parallelise retrieval with an LLM context prefetch.
  • Streaming the LLM response to improve perceived latency.
  • Aggressive caching on the slowest steps.

5. No observability

Log query + retrieved chunks + generated answer. Weekly review. Without data, iteration is blind. Tools like Langfuse or Arize Phoenix enable full pipeline traceability.

Typical production stack architecture

[User query]
    ↓
[Query analysis + transformation (HyDE, expansion)]
    ↓
[Hybrid search] → [Vector DB + BM25]
    ↓
[Re-ranker (cross-encoder)]
    ↓
[LLM with top-10 context + forced citation]
    ↓
[Response]

Cache at each step where applicable. For the agent orchestration component that queries the RAG, see CrewAI: agent teams.

Vector database choice

Database When to choose it
pgvector If already using Postgres — most cases
Qdrant Purpose-built, high performance, hybrid support
Weaviate Native hybrid, more features out-of-the-box
Pinecone Managed serverless, simple, pricier
Elasticsearch / OpenSearch Mature hybrid, already in the stack

Practical decision: pgvector by default; specialised vector database if scale requires it or native hybrid search is a hard requirement.

Streaming responses

python
# Retrieve first
chunks = retriever.get_relevant_documents(query)

# Stream LLM response
for token in llm.stream(prompt_with_context):
    yield token

Perceived latency improves significantly with streaming — the user sees the first word before the complete answer is ready.

Production monitoring

Metrics to track:

  • Retrieval recall: % of queries that retrieved a relevant document (requires ground truth).
  • Answer accuracy: via Ragas on a sample.
  • p50/p95/p99 latency of the full pipeline.
  • Token usage per query.
  • Cost per query.
  • User satisfaction: thumbs up/down where applicable.

Continuous dashboard to detect drift. RAG quality is not stable — corpus documents change, user queries evolve, models update.

Iteration: RAG is not “deploy and forget”

  • Add new documents continuously with incremental re-indexing.
  • Re-evaluate periodically against the golden dataset.
  • Update chunking strategy based on the most common failures.
  • Swap components (LLM, embedding model, re-ranker) as better options emerge — see OpenAI text-embedding-3 for the impact on retrieval quality.

Conclusion

Production RAG is an engineering discipline, not magic. The patterns are known; the execution varies. Successful teams invest in evaluation, observability, and iteration from the start. Failing teams deploy “basic RAG” and expect perfection. For new projects: applying these patterns from day one is cheaper than rearchitecting. For existing systems: audit against this list and prioritise improvements by expected impact.

Was this useful?
[Total: 15 · Average: 4.4]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.