A basic RAG pipeline does top-k embedding retrieval and passes the documents straight to the LLM. It works, but it hides a structural weakness. Embeddings are bi-encoders: they encode the query and the document independently and compare them with a dot product. That makes them lightning fast over millions of vectors, but mediocre at discriminating between the ten or twenty candidates that look most similar. This is where a reranker comes in, and where most RAG projects move from “decent” to “actually useful”.
Bi-encoder versus cross-encoder
A bi-encoder produces two independent vectors, one for the query and one for the document, and compares them with cosine or dot product. All the context the model can capture has to be compressed into those vectors without knowing what will be asked. That is an approximation, not a fine-grained relevance measure.
A cross-encoder does the opposite. It concatenates query and document and runs them together through the transformer, letting attention heads look token by token at which query words light up which parts of the document. The cost is enormous, since you need one model call per pair, but the signal is much richer. A bi-encoder tells you “this document looks like the kind of thing someone usually asks about”; a cross-encoder tells you “this document answers this specific question”.
The practical consequence is obvious: you cannot run a cross-encoder over millions of documents, but you can run it over a hundred. That is why the standard pattern combines both. The bi-encoder does cheap recall over the whole corpus, and the cross-encoder does expensive precision over the shortlist.
The canonical flow
The pipeline that works in production has three clearly separated stages. First a vector retrieval that returns the top-100 from the index in under a hundred milliseconds. Second a reranker that reorders those hundred documents and keeps the top-10, somewhere around three to five hundred milliseconds depending on the model. Third the LLM that generates the answer with those ten documents in context.
A few hundred extra milliseconds are added, but the quality of the top-10 improves so much that the LLM hallucinates less and answers with more precision. The user perceives a system that understands, not one that returns “something similar”.
Rerankers that matter today
Cohere Rerank is the hosted reference. The rerank-english-v3.0 model and its multilingual sibling rerank-multilingual-v3.0 still set the ceiling on most public benchmarks in 2024. The API is a single call, they charge per thousand documents processed, and the cost sits around two dollars per thousand queries with a hundred candidates. The downside is the classic one: external dependency, network latency, and data leaving your infrastructure.
BGE-reranker, from BAAI, is the de facto open-source standard. The bge-reranker-large variant with 568 million parameters comes very close to Cohere in quality, and the v2 generation —bge-reranker-v2-m3 and its multilingual cousins— already offers competitive options across several languages. It needs a GPU to serve with reasonable latency, but a modest T4 or L4 is enough for medium loads.
As a lightweight baseline there are the classic sentence-transformers cross-encoders like cross-encoder/ms-marco-MiniLM-L-12-v2. They do not beat BGE large, but they run on CPU and are good enough for prototypes. Jina AI also publishes a commercial multilingual reranker worth benchmarking if the domain is European.
A minimal example with BGE, to anchor the idea:
from sentence_transformers import CrossEncoder
model = CrossEncoder("BAAI/bge-reranker-large", max_length=512)
pairs = [(query, doc) for doc in candidates]
scores = model.predict(pairs, batch_size=32)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]
When it actually pays off
A reranker is not free, neither in latency nor in operational complexity, so it helps to know when it adds value and when it does not. It clearly helps on large corpora, above one hundred thousand documents, where embedding recall is good but the top-20 ordering is noisy. It helps with ambiguous queries, where several documents are plausible and the gap between the good one and the mediocre one does not fit in the geometry of the vector space. It helps on long-tail topics, where general-purpose embeddings lose nuance.
It does not pay off on small, well-curated corpora under ten thousand documents, where a decent embedding usually nails the top-1. It does not pay off on exact queries like code search or identifier lookup. And it does not pay off if your end-to-end latency budget is two hundred milliseconds: at that point there is simply no room for an extra half second.
On typical internal QA benchmarks with a two-hundred-thousand-document corpus, moving from embeddings-only to embeddings plus Cohere Rerank lifts Precision@10 from 45 to 68 percent, with Recall@10 climbing from 68 to 72. With an open-source cross-encoder the gain is somewhat smaller, from 45 to 62. The jumps are significant and they translate into less LLM hallucination.
A realistic latency budget
It is worth thinking in slots. Vector retrieval on a well-sized HNSW index sits between 20 and 80 ms. The reranker over a hundred candidates with BGE large on GPU takes around 300 to 500 ms; with hosted Cohere, between 200 and 400 ms depending on region. LLM generation is the dominant slot, almost always above one second. Put together, a RAG pipeline with a reranker lands somewhere between 1.5 and 3 seconds of total latency, which for asynchronous chat or enriched search is perfectly acceptable.
If the target is sub-second, there is room to manoeuvre: shrink the initial top-k to 50, use the base reranker instead of the large one, cache frequent queries, and batch whenever the usage pattern allows it. Caching over normalised queries is probably the best-ratio optimisation; in narrow domains the hit rate goes above 30 percent.
Common mistakes
Retrieving only the top-10 before reranking kills the benefit: if the bi-encoder already dropped the good document, the reranker cannot resurrect it. You need to reach up to top-50 or top-100. Not measuring before and after on a golden set is the other classic error; without Precision@5, Recall@10 and MRR computed over real queries, the intuition of “it improved” does not hold. And reranking a bad initial retrieval —chunks that are too large, embeddings mismatched to the domain— fixes nothing: the reranker can pick the best of what it receives, not invent what is missing.
Closing
Reranking is one of the highest leverage interventions in the whole RAG stack. Wiring in Cohere Rerank takes a few hours and one endpoint; deploying BGE-reranker on your own GPU takes a couple of days and one more service to look after. The hosted-versus-self-hosted choice almost always comes down to volume: below ten thousand queries a day Cohere is cheaper, above one hundred thousand the GPU pays for itself. The choice of whether to rerank at all, on the other hand, has largely stopped being a real choice in 2024: any RAG pipeline aiming at measurable quality does it. The interesting work has moved one level down —how to chunk, how to evaluate, how to cache— and that is where serious projects are putting their effort.