After two years of RAG in production, patterns separating successful from disappointing deployments are clear. This article compiles lessons — working techniques and anti-patterns to avoid — for teams with basics and wanting to scale.
Chunking: Not Simple
Anti-pattern: naive chunking (500-token fixed splits).
Working patterns:
- Semantic chunking: split by semantic boundaries, not tokens.
- Hierarchical chunks: document → sections → paragraphs.
- Strategic overlap: 10-20% overlap between chunks to preserve context.
- Metadata-rich: tags, author, date per chunk.
Libraries: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter, custom.
Hybrid Search
Vector-only loses exact-match cases (SKUs, codes, names). Combination:
- BM25 (keyword) + vector (semantic).
- Reciprocal Rank Fusion (RRF): merge results.
- Result: significantly better recall.
Elastic, OpenSearch, Weaviate, Qdrant offer native hybrid.
Re-Ranking
Production patterns:
- Retrieval: top-100 with embeddings.
- Re-rank: cross-encoder (Cohere Rerank, BGE-reranker) → top-10.
- LLM generation with top-10.
15-30% precision improvement.
Query Transformation
Bad user queries → bad retrieval. Techniques:
- HyDE: generate hypothetical answer, embed that, search.
- Query expansion: synonyms, related terms.
- Sub-queries: decompose complex query into multiple.
- Query classification: route per type.
LLM can do all pre-retrieval.
Metadata Filtering
To pre-filter retrieval space:
results = vectorstore.similarity_search(
query,
filter={"department": "engineering", "date": {"$gt": "2024-01-01"}}
)
Faster + more relevant than searching everything and filtering after.
Evaluation
Continuous eval essential:
- Golden dataset: 100-500 curated query-answer pairs.
- Ragas metrics: faithfulness, relevance, precision.
- Periodic human review.
- A/B testing when changing components.
Without eval, improvements are placebo.
Anti-Patterns
1. Context Stuffing
Throwing 20 chunks @ LLM. Result: “lost in middle”, hallucinations.
Better: relevant top-5, force citations.
2. Embedding Staleness
Recreate embeddings when changing model? No. Maintain consistency or complete reindex.
3. No Caching
Same repeated queries = same LLM costs. Cache:
- Embedding cache: Redis by query text.
- Result cache: by query + filters.
4. Ignoring Latency
Production RAG: <2s target. Retrieve + rerank + LLM can exceed.
Optimisation:
- Parallelise retrieval + LLM prefetch.
- Streaming responses.
- Aggressive cache.
5. No Observability
Log query + retrieved chunks + answer. Weekly review. Without data, iteration is blind.
Architecture Pattern
Typical production stack:
[User query]
↓
[Query analysis + transformation]
↓
[Hybrid search] → [vector DB + keyword]
↓
[Re-ranker]
↓
[LLM with context + citations]
↓
[Response]
Cache at each step if applicable.
Vector DB Choice
For production:
- pgvector: if already using Postgres (most).
- Qdrant: purpose-built, great performance.
- Weaviate: native hybrid, more features.
- Pinecone: managed, simple, pricier.
- Elasticsearch/OpenSearch: mature hybrid.
Decision: pgvector default; specialised if scale requires.
Costs
- Embeddings: one-time + updates.
- LLM generation: per query.
- Vector DB: storage + queries.
- Re-ranker: per query.
Optimise: cache hits, smaller models where sufficient, batch processing.
Streaming Responses
# Retrieve first
chunks = retriever.get_relevant_documents(query)
# Stream LLM response
for token in llm.stream(prompt_with_context):
yield token
Significantly better perceived latency.
Production Monitoring
Track:
- Retrieval recall: queries that retrieved relevant doc (requires ground truth).
- Answer accuracy: via Ragas.
- p50/p95/p99 latency.
- Token usage per query.
- Cost per query.
- User satisfaction: thumbs up/down.
Continuous dashboard to see drift.
Iteration
Production RAG isn’t “deploy + forget”:
- Add new docs continuously.
- Re-evaluate periodically.
- Update chunking strategy based on failures.
- Swap components (LLM, embeddings, re-ranker) as better options emerge.
Conclusion
Production RAG is engineering discipline, not magic. Patterns are known; execution varies. Successful teams invest in evaluation, observability, iteration. Failing teams deploy “basic RAG” and expect perfect. For new projects, start with these patterns from day 1 — cheaper than rearchitecting. For existing systems, audit against this list and optimise priorities.
Follow us on jacar.es for more on RAG patterns, production ML, and AI architectures.