Arquitectura Inteligencia Artificial

chunking hybrid search patterns produccion production rag vector db

RAG in Production: Patterns That Work and Those That Don’t

September 26, 2024 12 min read 143 reads

Table of contents

Key takeaways
Chunking: not simple
Hybrid search: BM25 + vector
Re-ranking: from 100 candidates to 10 relevant ones
Query transformation: attacking the root of bad retrieval
Metadata filtering: pre-filter the search space
Continuous evaluation: without this, everything else is placebo
Most common anti-patterns
1. Context stuffing
2. Embedding staleness
3. No caching
4. Ignoring latency
5. No observability
Typical production stack architecture
Vector database choice
Streaming responses
Production monitoring
Iteration: RAG is not “deploy and forget”
Conclusion

Actualizado: 2026-05-03

After two years of RAG in production, the patterns that separate successful from disappointing deployments are clear. This article compiles the lessons — working techniques and anti-patterns to avoid — for teams that already have a basic RAG running and want to scale it reliably.

Key takeaways

Naive 500-token fixed-split chunking is the most common error and the one with the highest impact on retrieval quality.
Hybrid search (BM25 + vector) significantly improves recall versus vector-only — especially for proper nouns, codes, and exact terms.
A re-ranking pipeline reduces the context sent to the LLM from top-100 to top-10 with a 15-30% precision improvement.
Without a golden dataset and continuous evaluation metrics, improvements are placebos — you cannot optimise what you don’t measure.
Production RAG is an engineering discipline, not magic: the patterns are known, the execution varies.

Chunking: not simple

Anti-pattern: naive chunking with fixed 500-token splits, ignoring document structure.

Working patterns:

Semantic chunking: split by semantic boundaries, not token count.
Hierarchical chunks: document → sections → paragraphs, enabling retrieval at multiple granularity levels.
Strategic overlap: 10-20% overlap between chunks to preserve context at boundaries.
Metadata-rich: tags, author, date, section per chunk — enables effective filtering.

Libraries: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter, custom implementations depending on document type.

Hybrid search: BM25 + vector

Vector-only search fails on exact-match cases: product SKUs, proper nouns, error codes, infrequent technical terms. The combination:

BM25 (keyword, exact match) + vector (semantic).
Reciprocal Rank Fusion (RRF): merging results from both paths.
Result: significantly better recall than either path alone.

Elasticsearch, OpenSearch, Weaviate and Qdrant offer native hybrid search. For vector database selection context, see pgvector: maturity in 2024.

Re-ranking: from 100 candidates to 10 relevant ones

The standard pipeline in mature production:

Retrieval: top-100 with embeddings (high recall, moderate precision).
Re-ranking: cross-encoder (Cohere Rerank, BGE-reranker) → top-10 (high precision).
LLM generation with the top-10 in context.

Typical precision improvement: 15-30%. Re-ranking adds latency (50-150ms), but the trade-off is worth it for critical responses.

Query transformation: attacking the root of bad retrieval

Poorly formulated user queries produce bad retrieval. Techniques:

HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, embed it, and search with it — the answer embedding is closer to relevant document embeddings than the question embedding.
Query expansion: add synonyms and related terms.
Sub-queries: decompose complex questions into multiple sub-searches.
Query classification: route to different indexes or strategies depending on question type.

The LLM can do all these transformations in a single pre-retrieval step.

Metadata filtering: pre-filter the search space

python

results = vectorstore.similarity_search(
    query,
    filter={"department": "engineering", "date": {"$gt": "2024-01-01"}}
)

Filtering before retrieval is faster and more relevant than searching the full index and filtering after. Well-designed metadata at chunking time is the foundation — you cannot filter by what was not indexed.

Continuous evaluation: without this, everything else is placebo

Golden dataset: 100-500 manually curated query-answer pairs, representative of real use cases.

Ragas metrics: – Faithfulness: is the answer supported by the retrieved context? – Answer relevance: does the answer answer the question? – Context precision: are the retrieved chunks relevant?

Process:

Evaluate against golden dataset before any change.
A/B testing when changing components (embedding model, chunking strategy, re-ranker).
Periodic human review of a production answer sample.
Continuous metrics dashboard — retrieval quality drift is silent.

Without evaluation, improvements are placebos. Most teams reporting “improving the RAG” have no baseline metrics.

Most common anti-patterns

1. Context stuffing

Sending 20 chunks to the LLM produces the “lost in the middle” effect: the model ignores relevant information in the middle of the context. Better: relevant top-5, force explicit citation.

2. Embedding staleness

Changing the embedding model without re-indexing the corpus produces silent inconsistencies. Better: full re-index or strict model consistency per index version.

3. No caching

The same repeated queries generate the same LLM cost repeatedly:

Embedding cache: Redis by query text — the embedding of “how much does the basic plan cost?” is always the same.
Result cache: by query + active filters.

4. Ignoring latency

The production target is <2s total. Retrieve + re-rank + LLM can easily exceed it:

Parallelise retrieval with an LLM context prefetch.
Streaming the LLM response to improve perceived latency.
Aggressive caching on the slowest steps.

5. No observability

Log query + retrieved chunks + generated answer. Weekly review. Without data, iteration is blind. Tools like Langfuse or Arize Phoenix enable full pipeline traceability.

Typical production stack architecture

[User query]
    ↓
[Query analysis + transformation (HyDE, expansion)]
    ↓
[Hybrid search] → [Vector DB + BM25]
    ↓
[Re-ranker (cross-encoder)]
    ↓
[LLM with top-10 context + forced citation]
    ↓
[Response]

Cache at each step where applicable. For the agent orchestration component that queries the RAG, see CrewAI: agent teams.

Vector database choice

Database	When to choose it
pgvector	If already using Postgres — most cases
Qdrant	Purpose-built, high performance, hybrid support
Weaviate	Native hybrid, more features out-of-the-box
Pinecone	Managed serverless, simple, pricier
Elasticsearch / OpenSearch	Mature hybrid, already in the stack

Practical decision: pgvector by default; specialised vector database if scale requires it or native hybrid search is a hard requirement.

Streaming responses

python

# Retrieve first
chunks = retriever.get_relevant_documents(query)

# Stream LLM response
for token in llm.stream(prompt_with_context):
    yield token

Perceived latency improves significantly with streaming — the user sees the first word before the complete answer is ready.

Production monitoring

Metrics to track:

Retrieval recall: % of queries that retrieved a relevant document (requires ground truth).
Answer accuracy: via Ragas on a sample.
p50/p95/p99 latency of the full pipeline.
Token usage per query.
Cost per query.
User satisfaction: thumbs up/down where applicable.

Continuous dashboard to detect drift. RAG quality is not stable — corpus documents change, user queries evolve, models update.

Iteration: RAG is not “deploy and forget”

Add new documents continuously with incremental re-indexing.
Re-evaluate periodically against the golden dataset.
Update chunking strategy based on the most common failures.
Swap components (LLM, embedding model, re-ranker) as better options emerge — see OpenAI text-embedding-3 for the impact on retrieval quality.

Conclusion

Production RAG is an engineering discipline, not magic. The patterns are known; the execution varies. Successful teams invest in evaluation, observability, and iteration from the start. Failing teams deploy “basic RAG” and expect perfection. For new projects: applying these patterns from day one is cheaper than rearchitecting. For existing systems: audit against this list and prioritise improvements by expected impact.

Was this useful?

[Total: 15 · Average: 4.4]

Post Views: 143

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.