Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial

bge cohere rerank cross-encoder rag reranking retrieval

Re-Ranking in RAG: The Piece That Really Raises Quality

July 10, 2024 11 min read 254 reads

Table of contents

Key takeaways
Bi-encoder versus cross-encoder
The canonical production flow
Rerankers that matter today
When it genuinely pays off
A realistic latency budget
Common mistakes
Conclusion

More about this article

Quick summary

The canonical flow combines top-100 embedding retrieval with a reranker that selects top-10 before passing to the LLM.
A well-chosen reranker lifts Precision@10 by 15–30 % on corpora above 100k documents.
Cohere Rerank is the hosted reference; BGE-reranker-large is the open-source standard.
The reranker does not fix a bad initial retrieval: if the bi-encoder already dropped the correct document, it cannot recover it.

Key concepts

Bi-encoder vs. cross-encoder: Bi-encoders encode query and document separately for speed; cross-encoders process them together for higher precision, at the cost of one call per pair.
Canonical flow: Top-100 vector retrieval → reranker selects top-10 → LLM generates; 300–500 ms are added but context quality improves substantially.
Key rerankers: Cohere Rerank (hosted, multilingual) and BGE-reranker-large (open-source, GPU) are today's standards; MiniLM cross-encoders serve as a CPU baseline.

Useful links

Keep reading

Actualizado: 2026-05-16

A reranker in RAG is a cross-encoder model that receives a short candidate list from an embedding retrieval stage and rescores each document against the query with full attention, producing a higher-precision ranking than embeddings alone can provide. A basic RAG pipeline without one does top-k embedding retrieval and passes the documents straight to the LLM. It works, but it hides a structural weakness. Embeddings are bi-encoders: they encode the query and the document independently and compare them with a dot product. That makes them lightning fast over millions of vectors, but mediocre at discriminating between the ten or twenty candidates that look most similar. That is where a reranker comes in — and where most RAG projects go from decent to actually useful.

Key takeaways

The canonical flow is: embedding retrieval top-100, reranker selects top-10, LLM generates with that context.
A well-chosen reranker lifts Precision@10 by 15–30 % on corpora above 100k documents.
Cohere Rerank is the most solid hosted option; BGE-reranker-large is the open-source standard.
The reranker does not fix a bad initial retrieval: if the bi-encoder already dropped the correct document, the reranker cannot recover it.
Below a 200 ms latency budget, the reranker probably does not fit; in async chat, it does.

Bi-encoder versus cross-encoder

A bi-encoder produces two independent vectors — one for the query, one for the document — and compares them with cosine or dot product. All context must be compressed into those vectors without knowing what will be asked. That is an efficient approximation, not a fine-grained relevance measure.

A cross-encoder does the opposite. It concatenates query and document and runs them together through the transformer, letting attention heads look token by token at which query words activate which document fragments. The cost is high — one model call per pair — but the signal is much richer. A bi-encoder says “this document looks like the kind of thing someone usually asks about”; a cross-encoder says “this document answers this specific question”.

In practice, you cannot use a cross-encoder over millions of documents, but you can over a hundred. That is why the standard pattern combines both. The bi-encoder does cheap recall over the whole corpus, and the cross-encoder does expensive precision over the shortlist.

The canonical production flow

Three clearly separated stages:

Vector retrieval: top-100 from the index in under 100 ms.
Reranker: reorders the 100, selects top-10, in 300–500 ms depending on the model.
LLM: generates the answer with those 10 documents in context.

A few hundred extra milliseconds are added, but the quality of the top-10 improves so much that the LLM hallucinates less and answers with more precision. The user notices: the system seems to understand rather than just returning “something close.” Reranking is particularly valuable in pipelines that also benefit from pgvector for vector storage or Cohere Embed v3 for quality embeddings.

Rerankers that matter today

Cohere Rerank is the hosted reference. The rerank-english-v3.0 and rerank-multilingual-v3.0 models set the ceiling on most public benchmarks. The API is a single call, they charge per thousand documents processed, and the cost sits around two dollars per thousand queries with a hundred candidates. The downside is the classic one: external dependency, network latency, and data leaving your infrastructure.

BGE-reranker, from BAAI, is the de facto open-source standard. The bge-reranker-large variant with 568 million parameters comes very close to Cohere in quality, and the v2 generation (bge-reranker-v2-m3) offers competitive multilingual options. It needs a GPU for reasonable latency, but a modest T4 or L4 handles medium loads.

As a lightweight baseline, classic sentence-transformers cross-encoders like cross-encoder/ms-marco-MiniLM-L-12-v2 run on CPU and are good enough for prototypes, though they do not beat BGE large.

python

from sentence_transformers import CrossEncoder

model = CrossEncoder("BAAI/bge-reranker-large", max_length=512)
pairs = [(query, doc) for doc in candidates]
scores = model.predict(pairs, batch_size=32)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]

When it genuinely pays off

The reranker clearly helps with:

Large corpora (above 100k documents) where embedding recall is good but top-20 ordering is noisy.
Ambiguous queries where several documents are plausible and the gap does not fit in vector geometry.
Long-tail topics where general-purpose embeddings lose domain nuance.

It does not pay off with:

Small well-curated corpora (<10k documents) where a decent embedding usually nails the top-1.
Exact queries like code search or identifier lookup.
Under-200 ms end-to-end latency budgets: no room for an extra half second.

On QA benchmarks at 200k-document scale — consistent with results published on BEIR and MTEB leaderboards — moving from embeddings-only to embeddings plus Cohere Rerank lifts Precision@10 from 45 to 68 %, with Recall@10 rising from 68 to 72. With open-source cross-encoder, from 45 to 62. Those jumps translate directly into less LLM hallucination.

A realistic latency budget

The actual budget looks like this:

Vector retrieval (well-sized HNSW index): 20–80 ms.
BGE large reranker on GPU: 300–500 ms. Hosted Cohere: 200–400 ms.
LLM generation: the dominant slot, almost always above 1 second.

That is 1.5–3 seconds total, fine for async chat or enriched search. For sub-second, there is room: shrink the initial top-k to 50, use the base reranker instead of large, and cache frequent queries. Query caches on normalised queries hit above 30 % in narrow domains.

Common mistakes

Retrieving only top-10 before reranking: if the bi-encoder already dropped the correct document, the reranker cannot resurrect it. Reach top-50 or top-100.
Not measuring before and after: without Precision@5, Recall@10 and MRR on real queries, the intuition of “it improved” does not hold.
Reranking a bad initial retrieval: chunks too large, embeddings mismatched to the domain. The reranker can pick the best of what it receives, not invent what is missing.

Conclusion

Reranking is one of the highest-leverage interventions in the whole RAG stack. Wiring in Cohere Rerank takes a few hours and one endpoint; deploying BGE-reranker on your own GPU takes a couple of days and one more service. The hosted-versus-self-hosted choice almost always comes down to volume: below 10k queries per day Cohere is cheaper, above 100k the GPU pays for itself. Whether to rerank is no longer a real question: any RAG pipeline aiming at measurable quality does it. Adding a reranker is the single highest-return architectural change available to most RAG systems today. The useful discussion has shifted to chunking strategy, evaluation methodology, and caching.

Was this useful?

[Total: 11 · Average: 4.6]

Post Views: 254

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Re-Ranking in RAG: The Piece That Really Raises Quality

More about this article

Key takeaways

Bi-encoder versus cross-encoder

The canonical production flow

Rerankers that matter today

When it genuinely pays off

A realistic latency budget

Common mistakes

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026