Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Arquitectura Inteligencia Artificial

Hybrid Search: Combining BM25 and Vectors Seriously

Hybrid Search: Combining BM25 and Vectors Seriously

Actualizado: 2026-05-03

For a couple of years the dominant RAG narrative was that vectors won everything. Chunk the docs, compute embeddings, drop them into any vector store, and semantic recall would take care of the rest. Production breaks that story quickly. Someone searches for a part number, a national ID, a ticket ID, a three-letter acronym, and the vector retriever returns semantically adjacent documents that are not the exact one. Meanwhile, anyone coming from pure BM25 ran into the opposite frustration: reformulated queries that shared meaning but not vocabulary fell out of the top-k. Hybrid search is the operational answer. Both retrievers run in parallel and their rankings are fused. By late 2024 it is the default in Elasticsearch, OpenSearch, Weaviate, Qdrant, Vespa and several layers on top of pgvector, with typical recall gains between 20% and 40% over either signal alone.

Key takeaways

  • Dense vectors capture semantic similarity; BM25 captures exact literalness. The two are complementary.
  • Reciprocal Rank Fusion (RRF) fuses rankings without depending on score scale — it is the reasonable starting point.
  • Hybrid search is already the default in Elasticsearch, Weaviate, Qdrant and Vespa.
  • The additional cost is double indexing and a few milliseconds of fanout; not hundreds of milliseconds in latency.
  • Not every case needs it: conversational support chatbots and natural-language FAQ usually live fine with pure vector.

Why hybrid wins where it wins

A dense vector captures semantic similarity learned during model training. That makes it excellent when query and document use different vocabulary for the same idea, when there are synonyms, paraphrases or even language switches. What it does not do well is distinguish rare tokens that barely appeared in the training corpus. A SKU like MZ-VL2T0B/AM, a CVE, a court case number or an uncommon proper noun end up projected into a generic semantic neighbourhood and, on a short query, the retriever returns documents that share topic but not the identifier.

BM25, on the other hand, rewards exactly that literalness. The classic Okapi formula weighs term frequency and document rarity, so a rare token that matches word for word jumps to the top of the ranking. The price is that BM25 knows nothing about semantics: if the user types “car” and the document says “automobile”, the lexical overlap is zero.

Hybrid is not a new technique. It is the admission that the two signals are complementary and that the sane thing is to combine them at query time. In golden-set evaluations over technical corpora the pattern repeats: queries with exact terms, numbers or acronyms improve dramatically; purely conceptual queries improve little or not at all; very few degrade.

Fusing rankings without wrestling with weights

The interesting part of modern hybrid is that it no longer depends on weighted score combinations. Mixing a BM25 score whose range depends on corpus and language with a cosine similarity bounded in zero-to-one was an inexhaustible source of brittle calibration.

Reciprocal Rank Fusion, proposed by Cormack, Clarke and Büttcher in 2009, reframes it. Instead of adding scores it adds contributions of the form 1 / (k + rank) where k is usually 60. The constant dampens the gap between first and second place and prevents a dominant retriever from crushing the other. Because it only uses ranks, RRF is insensitive to the original score scale, which means it works equally well fusing BM25 with dense vectors, two different dense models, or adding a cross-encoder reranker as a third retriever.

The other common option is alpha fusion, which normalises scores and computes a linear combination — implemented in Weaviate with a parameter between 0 and 1. It offers finer control when you want to deliberately shift weight towards one signal. In exchange it demands per-query or per-collection tuning. RRF is the reasonable starting point.

Where it is implemented and how ergonomic it feels

Elasticsearch and OpenSearch have offered native hybrid in the search API for months, with a dense field configured in the mapping and a block that combines a match clause for text with a knn clause for the vector inside the same query. Weaviate exposes a hybrid operation with the alpha mentioned above. Qdrant introduced multi-vector collections with sparse vectors next to dense ones and a FusionQuery that applies RRF over prefetches. Vespa goes further and lets you express the fusion as a ranking expression.

In pgvector the story is more hand-crafted but fully viable:

sql
WITH vec AS (
  SELECT id, RANK() OVER (ORDER BY embedding <=> $1) AS r
  FROM docs ORDER BY embedding <=> $1 LIMIT 50
),
kw AS (
  SELECT id, RANK() OVER (ORDER BY ts_rank(tsv, to_tsquery($2)) DESC) AS r
  FROM docs WHERE tsv @@ to_tsquery($2) LIMIT 50
)
SELECT id, SUM(1.0 / (60 + r)) AS rrf
FROM (SELECT * FROM vec UNION ALL SELECT * FROM kw) u
GROUP BY id ORDER BY rrf DESC LIMIT 10;

It is ugly, but it is one database. On multilingual corpora, keeping text-search configuration, stemmer, dictionaries and stop-words coherent starts to weigh.

The real cost and when it does not pay off

Two costs tend to be forgotten:

  • Double indexing: maintaining both an inverted index and a vector index doubles storage and ingestion work.
  • Latency: if both retrievers run in parallel and fusion happens in memory, the fanout adds milliseconds, not hundreds. Where hybrid gets expensive is when you try to tune per-query, adding cross-encoder reranking on top and caches at every layer.

Not every case needs it. Support chatbots with short, conversational queries, FAQ-style search where the user asks in natural language and the corpus is written in the same register — all these scenarios usually live comfortably with pure vector. The clear symptom that hybrid is needed is recurring complaints of the form “I searched for this exact code and it did not come up”.

Tuning and evaluation

The operational recipe is mechanical. Start with RRF at constant 60, top-50 per retriever, and measure against a golden set with queries labelled by type. If one retriever contributes little, raise its top-k or revisit the embedding model. If the high ranks fill up with duplicates, add deduplication by base document before fusion.

On top of the hybrid it is worth placing a cross-encoder reranker — such as Cohere Rerank or bge-reranker — which takes the top-100 and returns top-10 reordered by a slower but more precise model. That layer absorbs much of the noise any fusion leaves behind.

Hybrid search is not a cutting-edge technique; it is the new minimum viable baseline for serious RAG. The hard part is not the fusion. The hard part is accepting that classic lexical search, which we had been calling dead for years, is still half of the answer.

Was this useful?
[Total: 0 · Average: 0]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.