Hybrid Search: Combining BM25 and Vectors Seriously

Lupa examinando documentos representando búsqueda multi-modal

For a couple of years the dominant RAG narrative was that vectors won everything. Chunk the docs, compute embeddings, drop them into any vector store, and semantic recall would take care of the rest. Production breaks that story quickly. Someone searches for a part number, a national ID, a ticket ID, a three-letter acronym, and the vector retriever returns semantically adjacent documents that are not the exact one. Meanwhile, anyone coming from pure BM25 ran into the opposite frustration: reformulated queries that shared meaning but not vocabulary fell out of the top-k. Hybrid search is the operational answer. Both retrievers run in parallel and their rankings are fused. By late 2024 it is the default in Elasticsearch, OpenSearch, Weaviate, Qdrant, Vespa and several layers on top of pgvector, with typical recall gains between twenty and forty percent over either signal alone.

Why hybrid wins where it wins

A dense vector captures semantic similarity learned during model training. That makes it excellent when query and document use different vocabulary for the same idea, when there are synonyms, paraphrases or even language switches. What it does not do well is distinguish rare tokens that barely appeared in the training corpus. A SKU like MZ-VL2T0B/AM, a CVE, a court case number or an uncommon proper noun end up projected into a generic semantic neighbourhood and, on a short query, the retriever returns documents that share topic but not the identifier. BM25, on the other hand, rewards exactly that literalness. The classic Okapi formula weighs term frequency and document rarity, so a rare token that matches word for word jumps to the top of the ranking. The price is that BM25 knows nothing about semantics: if the user types car and the document says automobile, the lexical overlap is zero.

Hybrid is not a new technique. It is the admission that the two signals are complementary and that the sane thing is to combine them at query time instead of picking one and praying. In golden-set evaluations over technical corpora the pattern repeats: queries with exact terms, numbers or acronyms improve dramatically, purely conceptual queries improve little or not at all, and very few degrade.

Fusing rankings without wrestling with weights

The interesting part of modern hybrid is that it no longer depends on weighted score combinations. Mixing a BM25 score whose range depends on corpus and language with a cosine similarity bounded in zero-to-one was an inexhaustible source of brittle calibration. Reciprocal Rank Fusion, proposed by Cormack, Clarke and Büttcher in 2009, reframes it. Instead of adding scores it adds contributions of the form one over a constant plus the document rank in each list. The constant, usually sixty, dampens the gap between first and second place and prevents a dominant retriever from crushing the other. Because it only uses ranks, RRF is insensitive to the original score scale, which means it works equally well fusing BM25 with dense vectors, two different dense models, or adding a cross-encoder reranker as a third retriever.

The other common option is alpha fusion, which does normalise scores and compute a linear combination, typically implemented in Weaviate with a parameter between zero and one. It offers finer control when you want to deliberately shift weight towards one signal or the other, for example in e-commerce catalogues where keyword weight must stay high. In exchange it demands per-query or per-collection tuning of that alpha, and any change in the embedding model can force recalibration. RRF is the reasonable starting point.

Where it is implemented and how ergonomic it feels

Elasticsearch and OpenSearch have offered native hybrid in the search API for months, with a dense field configured in the mapping and a block that combines a match clause for text with a knn clause for the vector inside the same query. Weaviate exposes a hybrid operation with the alpha mentioned above. Qdrant introduced multi-vector collections with sparse vectors next to dense ones and a FusionQuery that applies RRF over prefetches. Vespa goes further and lets you express the fusion as a ranking expression. In pgvector the story is more hand-crafted but fully viable: combine a CTE using the distance operator with another using ts_rank over a tsvector, compute ranks with RANK() OVER, and sum the reciprocals of rank plus the constant in a final select. It is ugly, but it is one database.

WITH vec AS (
  SELECT id, RANK() OVER (ORDER BY embedding <=> $1) AS r
  FROM docs ORDER BY embedding <=> $1 LIMIT 50
),
kw AS (
  SELECT id, RANK() OVER (ORDER BY ts_rank(tsv, to_tsquery($2)) DESC) AS r
  FROM docs WHERE tsv @@ to_tsquery($2) LIMIT 50
)
SELECT id, SUM(1.0 / (60 + r)) AS rrf
FROM (SELECT * FROM vec UNION ALL SELECT * FROM kw) u
GROUP BY id ORDER BY rrf DESC LIMIT 10;

The pattern is literally the same across engines. What varies is how much work remains underneath. In Weaviate or Qdrant you get it in one call without worrying about BM25 tokenisation. In Postgres you must pick a text-search configuration, stemmer, dictionaries and stop-words, and keep them consistent with the content language. On multilingual corpora that starts to weigh.

The real cost and when it does not pay off

Two costs tend to be forgotten. The first is indexing: maintaining both an inverted index and a vector index doubles storage and ingestion work, although nothing stops you from reusing the same document store. The second is latency, but it is less serious than it looks if both retrievers run in parallel and fusion happens in memory. The fanout adds milliseconds, not hundreds. Where hybrid gets expensive is when you try to tune per-query, adding cross-encoder reranking on top and caches at every layer.

Not every case needs it. Support chatbots with short, conversational queries, FAQ-style search where the user asks in natural language and the corpus is written in the same register, cross-lingual search where multilingual embeddings already do the job, all these scenarios usually live comfortably with pure vector. The clear symptom that hybrid is needed is recurring complaints of the form I searched for this exact code and it did not come up, or a golden set where queries with literal terms show precision@10 far below the average.

Tuning and evaluation

The operational recipe is mechanical. Start with RRF at constant sixty, top-50 per retriever, and measure against a golden set with queries labelled by type. If one retriever contributes little, raise its top-k or revisit the embedding model. If the high ranks fill up with duplicates, add deduplication by base document before fusion. On top of the hybrid it is worth placing a cross-encoder reranker, such as Cohere Rerank or bge-reranker, which takes the top-100 and returns top-10 reordered by a slower but more precise model. That layer absorbs much of the noise any fusion leaves behind. And you measure in production, not on a generic benchmark, because real queries have a distribution of lengths and types that no MS MARCO replicates.

Hybrid search is not a cutting-edge technique, it is the new minimum viable baseline for serious RAG. Vector alone leaves out the literalness that humans still use when they know what they are looking for. BM25 alone leaves out the semantic flexibility that makes the LLM look smart. Combining them with RRF is cheap, maintains as cleanly as either alone, and on ninety percent of mixed corpora improves recall without degrading precision. The hard part is not the fusion. The hard part is accepting that classic lexical search, which we had been calling dead for years, is still half of the answer.

Entradas relacionadas