RAG in Production: Patterns That Work and Those That Don’t
Table of contents
- Key takeaways
- Chunking: not simple
- Hybrid search: BM25 + vector
- Re-ranking: from 100 candidates to 10 relevant ones
- Query transformation: attacking the root of bad retrieval
- Metadata filtering: pre-filter the search space
- Continuous evaluation: without this, everything else is placebo
- Most common anti-patterns
- 1. Context stuffing
- 2. Embedding staleness
- 3. No caching
- 4. Ignoring latency
- 5. No observability
- Typical production stack architecture
- Vector database choice
- Streaming responses
- Production monitoring
- Iteration: RAG is not “deploy and forget”
- Conclusion
Actualizado: 2026-05-03
After two years of RAG in production, the patterns that separate successful from disappointing deployments are clear. This article compiles the lessons — working techniques and anti-patterns to avoid — for teams that already have a basic RAG running and want to scale it reliably.
Key takeaways
- Naive 500-token fixed-split chunking is the most common error and the one with the highest impact on retrieval quality.
- Hybrid search (BM25 + vector) significantly improves recall versus vector-only — especially for proper nouns, codes, and exact terms.
- A re-ranking pipeline reduces the context sent to the LLM from top-100 to top-10 with a 15-30% precision improvement.
- Without a golden dataset and continuous evaluation metrics, improvements are placebos — you cannot optimise what you don’t measure.
- Production RAG is an engineering discipline, not magic: the patterns are known, the execution varies.
Chunking: not simple
Anti-pattern: naive chunking with fixed 500-token splits, ignoring document structure.
Working patterns:
- Semantic chunking: split by semantic boundaries, not token count.
- Hierarchical chunks: document → sections → paragraphs, enabling retrieval at multiple granularity levels.
- Strategic overlap: 10-20% overlap between chunks to preserve context at boundaries.
- Metadata-rich: tags, author, date, section per chunk — enables effective filtering.
Libraries: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter, custom implementations depending on document type.
Hybrid search: BM25 + vector
Vector-only search fails on exact-match cases: product SKUs, proper nouns, error codes, infrequent technical terms. The combination:
- BM25 (keyword, exact match) + vector (semantic).
- Reciprocal Rank Fusion (RRF): merging results from both paths.
- Result: significantly better recall than either path alone.
Elasticsearch, OpenSearch, Weaviate and Qdrant offer native hybrid search. For vector database selection context, see pgvector: maturity in 2024.
Re-ranking: from 100 candidates to 10 relevant ones
The standard pipeline in mature production:
- Retrieval: top-100 with embeddings (high recall, moderate precision).
- Re-ranking: cross-encoder (Cohere Rerank, BGE-reranker) → top-10 (high precision).
- LLM generation with the top-10 in context.
Typical precision improvement: 15-30%. Re-ranking adds latency (50-150ms), but the trade-off is worth it for critical responses.
Query transformation: attacking the root of bad retrieval
Poorly formulated user queries produce bad retrieval. Techniques:
- HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, embed it, and search with it — the answer embedding is closer to relevant document embeddings than the question embedding.
- Query expansion: add synonyms and related terms.
- Sub-queries: decompose complex questions into multiple sub-searches.
- Query classification: route to different indexes or strategies depending on question type.
The LLM can do all these transformations in a single pre-retrieval step.
Metadata filtering: pre-filter the search space
results = vectorstore.similarity_search(
query,
filter={"department": "engineering", "date": {"$gt": "2024-01-01"}}
)Filtering before retrieval is faster and more relevant than searching the full index and filtering after. Well-designed metadata at chunking time is the foundation — you cannot filter by what was not indexed.
Continuous evaluation: without this, everything else is placebo
Golden dataset: 100-500 manually curated query-answer pairs, representative of real use cases.
Ragas metrics: – Faithfulness: is the answer supported by the retrieved context? – Answer relevance: does the answer answer the question? – Context precision: are the retrieved chunks relevant?
Process:
- Evaluate against golden dataset before any change.
- A/B testing when changing components (embedding model, chunking strategy, re-ranker).
- Periodic human review of a production answer sample.
- Continuous metrics dashboard — retrieval quality drift is silent.
Without evaluation, improvements are placebos. Most teams reporting “improving the RAG” have no baseline metrics.
Most common anti-patterns
1. Context stuffing
Sending 20 chunks to the LLM produces the “lost in the middle” effect: the model ignores relevant information in the middle of the context. Better: relevant top-5, force explicit citation.
2. Embedding staleness
Changing the embedding model without re-indexing the corpus produces silent inconsistencies. Better: full re-index or strict model consistency per index version.
3. No caching
The same repeated queries generate the same LLM cost repeatedly:
- Embedding cache: Redis by query text — the embedding of “how much does the basic plan cost?” is always the same.
- Result cache: by query + active filters.
4. Ignoring latency
The production target is <2s total. Retrieve + re-rank + LLM can easily exceed it:
- Parallelise retrieval with an LLM context prefetch.
- Streaming the LLM response to improve perceived latency.
- Aggressive caching on the slowest steps.
5. No observability
Log query + retrieved chunks + generated answer. Weekly review. Without data, iteration is blind. Tools like Langfuse or Arize Phoenix enable full pipeline traceability.
Typical production stack architecture
[User query]
↓
[Query analysis + transformation (HyDE, expansion)]
↓
[Hybrid search] → [Vector DB + BM25]
↓
[Re-ranker (cross-encoder)]
↓
[LLM with top-10 context + forced citation]
↓
[Response]Cache at each step where applicable. For the agent orchestration component that queries the RAG, see CrewAI: agent teams.
Vector database choice
| Database | When to choose it |
|---|---|
| pgvector | If already using Postgres — most cases |
| Qdrant | Purpose-built, high performance, hybrid support |
| Weaviate | Native hybrid, more features out-of-the-box |
| Pinecone | Managed serverless, simple, pricier |
| Elasticsearch / OpenSearch | Mature hybrid, already in the stack |
Practical decision: pgvector by default; specialised vector database if scale requires it or native hybrid search is a hard requirement.
Streaming responses
# Retrieve first
chunks = retriever.get_relevant_documents(query)
# Stream LLM response
for token in llm.stream(prompt_with_context):
yield tokenPerceived latency improves significantly with streaming — the user sees the first word before the complete answer is ready.
Production monitoring
Metrics to track:
- Retrieval recall: % of queries that retrieved a relevant document (requires ground truth).
- Answer accuracy: via Ragas on a sample.
- p50/p95/p99 latency of the full pipeline.
- Token usage per query.
- Cost per query.
- User satisfaction: thumbs up/down where applicable.
Continuous dashboard to detect drift. RAG quality is not stable — corpus documents change, user queries evolve, models update.
Iteration: RAG is not “deploy and forget”
- Add new documents continuously with incremental re-indexing.
- Re-evaluate periodically against the golden dataset.
- Update chunking strategy based on the most common failures.
- Swap components (LLM, embedding model, re-ranker) as better options emerge — see OpenAI text-embedding-3 for the impact on retrieval quality.
Conclusion
Production RAG is an engineering discipline, not magic. The patterns are known; the execution varies. Successful teams invest in evaluation, observability, and iteration from the start. Failing teams deploy “basic RAG” and expect perfection. For new projects: applying these patterns from day one is cheaper than rearchitecting. For existing systems: audit against this list and prioritise improvements by expected impact.