Reranking and Relevance Scoring for RAG
After retrieval, you have a list of candidate documents that are semantically or lexically related to the query. But not all are equally relevant to answering the user's question. A reranker is a specialized model that re-scores retrieved documents with a fine-grained relevance label, allowing you to filter out noise and pass only the best results to the LLM. Reranking is one of the highest-impact optimizations in production RAG systems. Adding a reranker typically improves answer quality by 10–20% while reducing token consumption (by discarding irrelevant documents) and latency (fewer tokens to process).
Why Reranking Matters
In a basic RAG pipeline, you retrieve the top-K results from your vector or keyword index and feed all of them to the LLM. If K=10, the LLM reads 10 documents, which may total 4,000–6,000 tokens. But often, only 3–5 of those documents are actually relevant to the query; the others introduce noise. A reranker solves this by computing a relevance score (usually 0–1, or a label like "relevant", "partially relevant", "irrelevant") for each document relative to the query. You then keep only the top-N by reranker score and pass those to the LLM. The result: fewer tokens, faster inference, higher quality (less distraction from noise), and measurably better answer accuracy.
A 2024 benchmark by Cohere found that retrieval followed by reranking achieved 94% answer correctness on customer support QA, compared to 87% with retrieval alone (using the same LLM).
Reranking Models and Architectures
Two main architectures exist for reranking:
Bi-Encoders (aka "embedding models") encode the query and document independently, then compute their similarity. Fast (query encoded once, documents pre-encoded), but less precise because they cannot model query-document interactions.
Cross-Encoders encode the query and document together, allowing the model to see how they interact. Slower (must encode each query-document pair), but significantly more accurate. For most production systems, the accuracy gain justifies the latency cost.
| Model | Type | Latency per doc | Quality | Best For |
|---|---|---|---|---|
Cohere rerank-english-v3.0 | Cross-encoder | 15–20ms | Excellent | Production, general domains |
| LLM-based (e.g., GPT-4 mini) | Cross-encoder | 50–100ms | Excellent but costly | High-stakes, limited result sets |
mmarco-mMiniLMv2-L12-H384-v1 (HuggingFace) | Cross-encoder | 5–10ms | Good | Speed-critical, local deployment |
Open-source ms-marco-cross-encoder | Cross-encoder | 10–15ms | Good | Cost-free, local |
For 2026 production systems, Cohere's rerank API is the standard due to its balance of quality and speed. For cost-sensitive systems, open-source cross-encoders run locally with minimal latency.
Implementing a Reranking Pipeline
Here is a complete pipeline: retrieve candidates, rerank, and pass top results to an LLM:
from cohere import Client
from openai import OpenAI
cohere_client = Client(api_key="YOUR_COHERE_API_KEY")
openai_client = OpenAI()
def retrieve_candidates(query: str, retriever, k: int = 20) -> list[dict]:
"""Retrieve top-K candidates from your hybrid retriever (see article 4)."""
# This calls your hybrid retriever and returns top-K results
# Each result is a dict with keys: {"id", "text", "source", "score"}
return retriever.search(query, k=k)
def rerank_results(query: str, candidates: list[dict], k_final: int = 5) -> list[dict]:
"""Rerank candidates using Cohere's cross-encoder; return top-K."""
# Extract texts for reranking
texts = [doc["text"] for doc in candidates]
# Call Cohere rerank API
response = cohere_client.rerank(
query=query,
documents=texts,
model="rerank-english-v3.0",
top_n=k_final
)
# Map reranked results back to original documents
reranked_docs = []
for result in response.results:
original_doc = candidates[result.index]
reranked_docs.append({
**original_doc,
"rerank_score": result.relevance_score,
"rerank_position": len(reranked_docs) + 1
})
return reranked_docs
def rag_pipeline(query: str, retriever, llm_model: str = "gpt-4o-mini") -> dict:
"""Full RAG pipeline: retrieve, rerank, prompt, generate."""
# Step 1: Retrieve top-20 candidates
candidates = retrieve_candidates(query, retriever, k=20)
print(f"Retrieved {len(candidates)} candidates")
# Step 2: Rerank and keep top-5
top_results = rerank_results(query, candidates, k_final=5)
print(f"After reranking: {len(top_results)} results")
# Step 3: Format context for LLM
context = "\n\n".join([
f"Source: {doc['source']}\n{doc['text']}"
for doc in top_results
])
# Step 4: Prompt and generate
system_prompt = """You are a helpful assistant answering questions based on provided documents.
Always cite your sources by including the document name in brackets, e.g., [source_name].
If the provided documents do not contain the answer, say so clearly."""
response = openai_client.chat.completions.create(
model=llm_model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
],
temperature=0.2
)
return {
"query": query,
"retrieved_count": len(candidates),
"reranked_count": len(top_results),
"answer": response.choices[0].message.content,
"sources": [doc["source"] for doc in top_results]
}
# Example usage
query = "How do I configure async/await in Python?"
result = rag_pipeline(query, retriever=your_retriever)
print(f"\nAnswer: {result['answer']}")
print(f"Sources: {result['sources']}")
Output:
Retrieved 20 candidates
After reranking: 5 results
Answer: To configure async/await in Python, you use the `asyncio` module...
Sources: ['python-async-guide.md', 'asyncio-api-docs.md']
Reranking Strategies and Cost Trade-Offs
Reranking adds latency and cost. For a 20-candidate list, Cohere's rerank API charges based on tokens; a typical rerank costs $0.001–0.003. The cost is small but multiplies across thousands of queries. Consider these strategies:
Strategy 1: Always Rerank (High Quality, Moderate Cost) Retrieve K candidates, rerank all, keep top-N. Best for high-stakes QA (legal, medical, support) where accuracy is critical.
Strategy 2: Threshold-Based Reranking (Balanced) Retrieve K candidates. If the top result's initial score (from hybrid retrieval) exceeds a threshold (e.g., 0.85), skip reranking; it is likely correct. Otherwise, rerank. This balances cost and quality.
Strategy 3: Two-Stage Reranking (High Throughput)
Use a lightweight, fast reranker (e.g., ms-marco-cross-encoder) as a first pass to filter to 10 results, then use Cohere's premium reranker for final ranking. Costs less than reranking all K.
Strategy 4: LLM-Based Reranking (Premium Quality) Use an LLM (GPT-4 mini) as the reranker: prompt it to score each result on relevance. Highest quality but slowest and most expensive. Reserve for high-value queries.
Common Reranking Pitfalls
Overconfidence in Scores: Reranker scores are relative, not absolute. A document with score 0.9 is not guaranteed to be relevant; the reranker is saying it is the most relevant in the batch. Always pair reranking with human feedback to calibrate thresholds.
Reranking Incompatible Formats: If your retrieved documents are noisy (e.g., truncated, malformed), reranker quality degrades. Clean up retrieval quality first (article 4: hybrid retrieval, article 2: good chunking).
Forgetting Rerank Costs in Latency Budget: Reranking adds 20–50ms. In a latency budget of 100ms, this is significant. Profile your p95 latency; if it is critical, consider caching rerank results or using a faster model.
Key Takeaways
- Reranking re-scores retrieved documents with fine-grained relevance labels, filtering noise before the LLM reads them.
- Cross-encoders (like Cohere's rerank) are more accurate than bi-encoders but require encoding each query-document pair.
- Always rerank in production for high-stakes QA; use threshold-based reranking to save costs in throughput-heavy systems.
- Reranking improves answer quality by 10–20% and reduces token consumption by filtering irrelevant documents.
- Monitor rerank score distributions to tune thresholds; miscalibrated thresholds lead to high variance in answer quality.
Frequently Asked Questions
Should I rerank all K results or only the top N?
Rerank all K. The reranker may find a highly relevant document ranked 15th by initial retrieval. Reranking only top-N risks missing it. The cost to rerank all 20 is minimal (0.1–0.3 cents); the quality gain is significant.
What is a good reranker score threshold?
Thresholds are domain-specific. For customer support, a threshold of 0.5–0.6 is typical: reranker scores above that are "relevant", below that are "maybe relevant" and filtered. Test on your evaluation set (article 8) to find the local optimum.
Can I rerank without calling an external API?
Yes. Open-source cross-encoders (HuggingFace's ms-marco-cross-encoder family) run locally with minimal setup. They are 5–10 times slower per document than API calls but cost nothing and keep data on-premise. Trade-off: speed vs. cost/privacy.
Does reranking work with non-English queries?
Cohere's rerank-english-v3.0 is English-specific. Use rerank-multilingual-v3.0 for non-English. Most open-source cross-encoders are multilingual by default.
How do I measure if reranking is helping my RAG system?
Compare answer quality with and without reranking on a labeled evaluation set (article 8). Metrics: exact match, F1, BLEU. If reranking improves your metric by >5%, it is worth the cost/latency.
Further Reading
- Cohere Rerank API Documentation — official API guide with benchmarks.
- Cross-Encoder Information Retrieval — foundational paper on cross-encoder ranking.
- Learning to Rank: From Theory to Practice — comprehensive survey of ranking models.
- MS MARCO Dataset and Leaderboard — standard benchmark for reranking models.