Skip to main content

Reranking and Relevance Scoring for RAG

After retrieval, you have a list of candidate documents that are semantically or lexically related to the query. But not all are equally relevant to answering the user's question. A reranker is a specialized model that re-scores retrieved documents with a fine-grained relevance label, allowing you to filter out noise and pass only the best results to the LLM. Reranking is one of the highest-impact optimizations in production RAG systems. Adding a reranker typically improves answer quality by 10–20% while reducing token consumption (by discarding irrelevant documents) and latency (fewer tokens to process).

Why Reranking Matters

In a basic RAG pipeline, you retrieve the top-K results from your vector or keyword index and feed all of them to the LLM. If K=10, the LLM reads 10 documents, which may total 4,000–6,000 tokens. But often, only 3–5 of those documents are actually relevant to the query; the others introduce noise. A reranker solves this by computing a relevance score (usually 0–1, or a label like "relevant", "partially relevant", "irrelevant") for each document relative to the query. You then keep only the top-N by reranker score and pass those to the LLM. The result: fewer tokens, faster inference, higher quality (less distraction from noise), and measurably better answer accuracy.

A 2024 benchmark by Cohere found that retrieval followed by reranking achieved 94% answer correctness on customer support QA, compared to 87% with retrieval alone (using the same LLM).

Reranking Models and Architectures

Two main architectures exist for reranking:

Bi-Encoders (aka "embedding models") encode the query and document independently, then compute their similarity. Fast (query encoded once, documents pre-encoded), but less precise because they cannot model query-document interactions.

Cross-Encoders encode the query and document together, allowing the model to see how they interact. Slower (must encode each query-document pair), but significantly more accurate. For most production systems, the accuracy gain justifies the latency cost.

ModelTypeLatency per docQualityBest For
Cohere rerank-english-v3.0Cross-encoder15–20msExcellentProduction, general domains
LLM-based (e.g., GPT-4 mini)Cross-encoder50–100msExcellent but costlyHigh-stakes, limited result sets
mmarco-mMiniLMv2-L12-H384-v1 (HuggingFace)Cross-encoder5–10msGoodSpeed-critical, local deployment
Open-source ms-marco-cross-encoderCross-encoder10–15msGoodCost-free, local

For 2026 production systems, Cohere's rerank API is the standard due to its balance of quality and speed. For cost-sensitive systems, open-source cross-encoders run locally with minimal latency.

Implementing a Reranking Pipeline

Here is a complete pipeline: retrieve candidates, rerank, and pass top results to an LLM:

from cohere import Client
from openai import OpenAI

cohere_client = Client(api_key="YOUR_COHERE_API_KEY")
openai_client = OpenAI()

def retrieve_candidates(query: str, retriever, k: int = 20) -> list[dict]:
"""Retrieve top-K candidates from your hybrid retriever (see article 4)."""
# This calls your hybrid retriever and returns top-K results
# Each result is a dict with keys: {"id", "text", "source", "score"}
return retriever.search(query, k=k)

def rerank_results(query: str, candidates: list[dict], k_final: int = 5) -> list[dict]:
"""Rerank candidates using Cohere's cross-encoder; return top-K."""

# Extract texts for reranking
texts = [doc["text"] for doc in candidates]

# Call Cohere rerank API
response = cohere_client.rerank(
query=query,
documents=texts,
model="rerank-english-v3.0",
top_n=k_final
)

# Map reranked results back to original documents
reranked_docs = []
for result in response.results:
original_doc = candidates[result.index]
reranked_docs.append({
**original_doc,
"rerank_score": result.relevance_score,
"rerank_position": len(reranked_docs) + 1
})

return reranked_docs

def rag_pipeline(query: str, retriever, llm_model: str = "gpt-4o-mini") -> dict:
"""Full RAG pipeline: retrieve, rerank, prompt, generate."""

# Step 1: Retrieve top-20 candidates
candidates = retrieve_candidates(query, retriever, k=20)
print(f"Retrieved {len(candidates)} candidates")

# Step 2: Rerank and keep top-5
top_results = rerank_results(query, candidates, k_final=5)
print(f"After reranking: {len(top_results)} results")

# Step 3: Format context for LLM
context = "\n\n".join([
f"Source: {doc['source']}\n{doc['text']}"
for doc in top_results
])

# Step 4: Prompt and generate
system_prompt = """You are a helpful assistant answering questions based on provided documents.
Always cite your sources by including the document name in brackets, e.g., [source_name].
If the provided documents do not contain the answer, say so clearly."""

response = openai_client.chat.completions.create(
model=llm_model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
],
temperature=0.2
)

return {
"query": query,
"retrieved_count": len(candidates),
"reranked_count": len(top_results),
"answer": response.choices[0].message.content,
"sources": [doc["source"] for doc in top_results]
}

# Example usage
query = "How do I configure async/await in Python?"
result = rag_pipeline(query, retriever=your_retriever)
print(f"\nAnswer: {result['answer']}")
print(f"Sources: {result['sources']}")

Output:

Retrieved 20 candidates
After reranking: 5 results

Answer: To configure async/await in Python, you use the `asyncio` module...
Sources: ['python-async-guide.md', 'asyncio-api-docs.md']

Reranking Strategies and Cost Trade-Offs

Reranking adds latency and cost. For a 20-candidate list, Cohere's rerank API charges based on tokens; a typical rerank costs $0.001–0.003. The cost is small but multiplies across thousands of queries. Consider these strategies:

Strategy 1: Always Rerank (High Quality, Moderate Cost) Retrieve K candidates, rerank all, keep top-N. Best for high-stakes QA (legal, medical, support) where accuracy is critical.

Strategy 2: Threshold-Based Reranking (Balanced) Retrieve K candidates. If the top result's initial score (from hybrid retrieval) exceeds a threshold (e.g., 0.85), skip reranking; it is likely correct. Otherwise, rerank. This balances cost and quality.

Strategy 3: Two-Stage Reranking (High Throughput) Use a lightweight, fast reranker (e.g., ms-marco-cross-encoder) as a first pass to filter to 10 results, then use Cohere's premium reranker for final ranking. Costs less than reranking all K.

Strategy 4: LLM-Based Reranking (Premium Quality) Use an LLM (GPT-4 mini) as the reranker: prompt it to score each result on relevance. Highest quality but slowest and most expensive. Reserve for high-value queries.

Common Reranking Pitfalls

Overconfidence in Scores: Reranker scores are relative, not absolute. A document with score 0.9 is not guaranteed to be relevant; the reranker is saying it is the most relevant in the batch. Always pair reranking with human feedback to calibrate thresholds.

Reranking Incompatible Formats: If your retrieved documents are noisy (e.g., truncated, malformed), reranker quality degrades. Clean up retrieval quality first (article 4: hybrid retrieval, article 2: good chunking).

Forgetting Rerank Costs in Latency Budget: Reranking adds 20–50ms. In a latency budget of 100ms, this is significant. Profile your p95 latency; if it is critical, consider caching rerank results or using a faster model.

Key Takeaways

  • Reranking re-scores retrieved documents with fine-grained relevance labels, filtering noise before the LLM reads them.
  • Cross-encoders (like Cohere's rerank) are more accurate than bi-encoders but require encoding each query-document pair.
  • Always rerank in production for high-stakes QA; use threshold-based reranking to save costs in throughput-heavy systems.
  • Reranking improves answer quality by 10–20% and reduces token consumption by filtering irrelevant documents.
  • Monitor rerank score distributions to tune thresholds; miscalibrated thresholds lead to high variance in answer quality.

Frequently Asked Questions

Should I rerank all K results or only the top N?

Rerank all K. The reranker may find a highly relevant document ranked 15th by initial retrieval. Reranking only top-N risks missing it. The cost to rerank all 20 is minimal (0.1–0.3 cents); the quality gain is significant.

What is a good reranker score threshold?

Thresholds are domain-specific. For customer support, a threshold of 0.5–0.6 is typical: reranker scores above that are "relevant", below that are "maybe relevant" and filtered. Test on your evaluation set (article 8) to find the local optimum.

Can I rerank without calling an external API?

Yes. Open-source cross-encoders (HuggingFace's ms-marco-cross-encoder family) run locally with minimal setup. They are 5–10 times slower per document than API calls but cost nothing and keep data on-premise. Trade-off: speed vs. cost/privacy.

Does reranking work with non-English queries?

Cohere's rerank-english-v3.0 is English-specific. Use rerank-multilingual-v3.0 for non-English. Most open-source cross-encoders are multilingual by default.

How do I measure if reranking is helping my RAG system?

Compare answer quality with and without reranking on a labeled evaluation set (article 8). Metrics: exact match, F1, BLEU. If reranking improves your metric by >5%, it is worth the cost/latency.

Further Reading