Tras dos años de RAG en producción, patterns que separan deployments exitosos de decepcionantes son claros. Este artículo compila lessons — técnicas que funcionan y anti-patterns a evitar — para equipos que ya tienen básico + quieren escalar.
Chunking: no es simple
Anti-pattern: chunking naive (500 tokens fixed splits).
Patterns que funcionan:
- Semantic chunking: split por semantic boundaries, no tokens.
- Hierarchical chunks: document → sections → paragraphs.
- Overlap strategic: 10-20% overlap entre chunks para preserve context.
- Metadata-rich: tags, author, date per chunk.
Libraries: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter, custom.
Hybrid search
Solo vector search pierde casos exact-match (SKUs, codes, names). Combinación:
- BM25 (keyword) + vector (semantic).
- Reciprocal Rank Fusion (RRF): merge results.
- Resultado: recall significantly mejor.
Elastic, OpenSearch, Weaviate, Qdrant ofrecen hybrid nativo.
Re-ranking
Patterns production:
- Retrieval: top-100 con embeddings.
- Re-rank: cross-encoder (Cohere Rerank, BGE-reranker) → top-10.
- LLM generation con top-10.
Mejora precision 15-30%.
Query transformation
User queries malas → retrieval malo. Techniques:
- HyDE: generate hypothetical answer, embed that, search.
- Query expansion: sinónimos, related terms.
- Sub-queries: descomponer query complex en multiple.
- Query classification: ruta según type.
LLM puede hacer todos pre-retrieval.
Metadata filtering
Para pre-filter space de retrieval:
results = vectorstore.similarity_search(
query,
filter={"department": "engineering", "date": {"$gt": "2024-01-01"}}
)
Faster + more relevant que buscar todo y filter después.
Evaluation
Continuous eval essential:
- Golden dataset: 100-500 query-answer pairs curados.
- Ragas metrics: faithfulness, relevance, precision.
- Human review periodic.
- A/B testing when changing components.
Sin eval, mejoras son placebo.
Anti-patterns
1. Stuffing context
Throwing 20 chunks @ LLM. Resultado: “lost in middle”, hallucinations.
Mejor: top-5 relevant, forzar citations.
2. Embedding staleness
Recrear embeddings cuando cambias modelo? No. Mantener consistency o reindex completo.
3. No caching
Mismos queries repetidos = same LLM costs. Cache:
- Embedding cache: Redis por query text.
- Result cache: por query + filters.
4. Ignoring latency
Production RAG: <2s target. Retrieve + rerank + LLM can exceed.
Optimization:
- Parallelize retrieval + LLM prefetch.
- Streaming responses.
- Cache aggressive.
5. No observability
Log query + retrieved chunks + answer. Revisar weekly. Sin data, iteration es ciega.
Architecture pattern
Production stack típico:
[User query]
↓
[Query analysis + transformation]
↓
[Hybrid search] → [vector DB + keyword]
↓
[Re-ranker]
↓
[LLM with context + citations]
↓
[Response]
Caché en cada step si applicable.
Vector DB choice
Para producción:
- pgvector: si ya usas Postgres (la mayoría).
- Qdrant: purpose-built, great performance.
- Weaviate: hybrid nativo, more features.
- Pinecone: managed, simple, pricier.
- Elasticsearch/OpenSearch: hybrid mature.
Decision: pgvector default; specialized si escala requiere.
Costes
- Embeddings: one-time + updates.
- LLM generation: per query.
- Vector DB: storage + queries.
- Re-ranker: per query.
Optimizar: cache hits, smaller models donde suficiente, batch processing.
Streaming responses
# Retrieve primero
chunks = retriever.get_relevant_documents(query)
# Stream LLM response
for token in llm.stream(prompt_with_context):
yield token
Perceived latency mejor significantly.
Monitoring en producción
Track:
- Retrieval recall: queries que retrieved relevant doc (requiere ground truth).
- Answer accuracy: via Ragas.
- Latency p50/p95/p99.
- Token usage por query.
- Cost per query.
- User satisfaction: thumbs up/down.
Dashboard continuous para ver drift.
Iteration
Production RAG no es “deploy + olvidar”:
- Add new docs continuously.
- Re-evaluate periodically.
- Update chunks strategy based on failures.
- Swap components (LLM, embeddings, re-ranker) cuando mejor opciones emerge.
Conclusión
RAG production es engineering discipline, no magic. Patterns son conocidos; execution varies. Teams exitosos invest in evaluation, observability, iteration. Teams que fallan deploy “basic RAG” y expect perfect. Para nuevos projects, start with these patterns desde día 1 — cheaper que rearchitect. Para existing systems, audit contra this list y optimize priorities.
Síguenos en jacar.es para más sobre RAG patterns, production ML y arquitecturas IA.