Skip to main content

Cross-Encoder Reranking: Pairwise Relevance Scoring

Cross-encoder reranking is a neural retrieval method that scores document-query pairs directly for relevance, taking documents retrieved by BM25 or dense retrieval and re-ranking them to surface the most relevant ones. Unlike dense retrievers (which encode queries and documents independently in the same vector space), a cross-encoder receives the full query-document pair as input and outputs a single relevance score. This pairwise architecture allows the model to attend to all interactions between query tokens and document tokens, capturing subtle relevance signals that single-vector-space methods miss. A cross-encoder adds 5–15% accuracy improvement over fusion-only hybrid retrieval, with a computational cost of reranking 50–100 documents per query (<500ms latency). For RAG systems where answer quality is paramount, cross-encoder reranking is the final stage of multi-stage ranking, ensuring the LLM receives the most relevant grounded evidence.

Cross-Encoder vs. Bi-Encoder: Architectural Differences

Bi-Encoder (Dense Retrieval):

  • Separately encodes query and document into independent vectors.
  • Similarity is computed as geometric distance (e.g., cosine similarity) in vector space.
  • Supports fast approximate nearest neighbor search (ANN).
  • Efficient for large-scale retrieval (millions of documents) but coarse-grained ranking.
  • Example: all-MiniLM-L6-v2, text-embedding-3-small.

Cross-Encoder:

  • Concatenates query and document and passes them through a shared transformer.
  • Outputs a single relevance score per pair via a classifier head.
  • Requires computing scores for every candidate document (no ANN).
  • Slower but more accurate for re-ranking a fixed candidate set (50–100 documents).
  • Example: cross-encoder/ms-marco-MiniLM-L-6-v2, cross-encoder/qnli-distilroberta-base.

Workflow: Dense retrieval (bi-encoder) efficiently retrieves top-100 candidates → cross-encoder reranks top-100 to extract top-5 for LLM context.

Cross-Encoder Models: Selection and Trade-offs

Popular open-source cross-encoder models and benchmarks:

ModelSizeSpeed (pairs/sec)AccuracyUse Case
cross-encoder/ms-marco-MiniLM-L-6-v233M5008.1/10General-purpose, balanced
cross-encoder/ms-marco-TinyBERT-L-2-v214M2,0007.6/10Low-latency, mobile
cross-encoder/ms-marco-ELECTRA-Base110M1008.7/10High accuracy, if latency permits
cross-encoder/qnli-distilroberta-base82M2008.4/10QA-optimized

For most RAG systems, cross-encoder/ms-marco-MiniLM-L-6-v2 is the default: 33M parameters, 500 document pairs scored per second on a single GPU, and strong accuracy. For mobile/serverless with <50ms latency budgets, use TinyBERT. For high-stakes applications (medical, legal), upgrade to ELECTRA-Base if 10ms latency per document is acceptable.

How Cross-Encoder Reranking Works

A cross-encoder processes the input as:

[CLS] <query tokens> [SEP] <document tokens> [SEP]

The model forward-passes this token sequence through 6–12 transformer layers, pooling the [CLS] token's final representation and feeding it to a classification head (typically a linear layer + sigmoid) that outputs a score in [0, 1] or [0, infinity] depending on the output activation.

For a query "What is transformer attention mechanism?" and document "Transformer models use attention to weight input tokens", the cross-encoder:

  1. Tokenizes both into subword tokens.
  2. Concatenates with special tokens: [CLS] what is transformer attention [SEP] transformer models use attention [SEP]
  3. Forward-passes through transformer layers, allowing attention heads to compute query-document interactions.
  4. Outputs relevance score (e.g., 0.92), indicating high relevance.

This contrasts with dense retrieval, where the query and document are encoded in isolation and later compared geometrically. Cross-encoders' joint encoding captures interactions like token co-occurrence, syntactic alignment, and semantic coherence within the pair.

Implementation: Reranking a Hybrid Retrieval Result

from sentence_transformers import CrossEncoder
import numpy as np

# Initialize cross-encoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Hybrid-fused document candidates (from RRF)
candidates = [
('doc_1', 'Transformer attention mechanism is a core component of modern NLP.'),
('doc_2', 'The attention mechanism in transformer networks scales quadratically.'),
('doc_3', 'BERT uses transformer architecture for natural language processing.'),
('doc_4', 'Recurrent networks are an alternative to transformer-based models.'),
('doc_5', 'Attention is a key concept in machine learning broadly.'),
]

query = "What is transformer attention mechanism?"

# Prepare pairs: [(query, document1), (query, document2), ...]
pairs = [(query, doc_text) for _, doc_text in candidates]

# Score all pairs
scores = model.predict(pairs) # Returns array of [num_candidates]
# scores = array([0.92, 0.87, 0.85, 0.62, 0.45])

# Rerank: create tuples of (doc_id, original_text, rerank_score)
reranked = sorted(
[(candidates[i][0], candidates[i][1], scores[i]) for i in range(len(candidates))],
key=lambda x: x[2],
reverse=True
)

print("Reranked results:")
for rank, (doc_id, text, score) in enumerate(reranked, 1):
print(f"{rank}. Score: {score:.3f}, Doc: {text[:60]}...")

# Output:
# 1. Score: 0.920, Doc: Transformer attention mechanism is a core...
# 2. Score: 0.870, Doc: The attention mechanism in transformer...
# 3. Score: 0.850, Doc: BERT uses transformer architecture...
# 4. Score: 0.620, Doc: Recurrent networks are an alternative...
# 5. Score: 0.450, Doc: Attention is a key concept...

Notice how the cross-encoder correctly ranks doc_1 (exact keyword match + semantic alignment) highest, doc_2 (keyword match but less direct) second, and doc_5 (semantic but vague, lacking "transformer") lowest.

Computational Cost and Latency Analysis

Reranking N documents with a cross-encoder costs:

Latency: ~2–4 ms per document pair (including tokenization and forward pass). Reranking 50 documents = 100–200 ms. Reranking 100 documents = 200–400 ms.

Memory: ~500 MB GPU VRAM for MiniLM variant in FP32; scales with batch size. Typical batch size (32–64) uses <2 GB VRAM.

Total RAG latency example:

  • BM25 retrieval: 10 ms
  • Dense retrieval: 50 ms
  • Fuse (RRF): 5 ms
  • Rerank top-50: 150 ms
  • Total: ~215 ms (acceptable for interactive applications)

If latency is critical (<100 ms), rerank only top-20 (skip the bottom-30 of the fused list) with minimal accuracy cost, or use a smaller cross-encoder model (TinyBERT: ~0.5 ms per pair).

When to Apply Reranking

Reranking is most valuable:

  1. Top-k is large (50+): Fusion alone produces a ranked list with many documents; reranking refines ranking significantly.
  2. Query is ambiguous: Multi-faceted queries (e.g., "impact of AI on society") benefit from neural reranking's ability to capture subtle semantic alignment.
  3. Answer quality is critical: Legal/medical/financial RAG systems where wrong documents risk serious errors.
  4. Latency budget exists (>200 ms): If you need <100 ms end-to-end, rerank only top-10 or skip reranking.

Reranking provides diminishing returns on pre-ranked lists of <10 documents, especially if the pre-ranker (fusion) is already strong.

Fine-Tuning Cross-Encoders for Your Domain

If you have labeled relevance data (queries with human-annotated relevant/irrelevant documents), you can fine-tune a cross-encoder to your domain, improving accuracy by 5–10%.

from sentence_transformers import CrossEncoder, InputExample, losses
from torch.utils.data import DataLoader

# Training data: [(query, document, relevance_label), ...]
train_samples = [
InputExample(texts=["What is NLP?", "NLP is a subfield of AI..."], label=1.0),
InputExample(texts=["What is NLP?", "Cooking is an art form..."], label=0.0),
InputExample(texts=["Transformer attention?", "Attention mechanisms..."], label=0.95),
# ... more examples
]

# Load pre-trained model
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Create training dataloader
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=32)

# Define loss (contrastive or triplet)
train_loss = losses.CosineSimilarityLoss(model)

# Fine-tune (a few epochs)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
show_progress_bar=True
)

# Save fine-tuned model
model.save('cross-encoder/my-domain-reranker')

For a production domain-specific RAG system with access to labeled data, fine-tuning a cross-encoder is a high-ROI investment (100–500 labeled pairs achieves meaningful improvement).

Multi-Stage Reranking: Coarse-to-Fine

For very large candidate sets (500+ documents), apply reranking in multiple stages to control latency:

  1. Stage 1 (coarse): Rerank top-500 with a small, fast cross-encoder (TinyBERT, 500 pairs/sec) → extract top-100.
  2. Stage 2 (fine): Rerank top-100 with a larger cross-encoder (MiniLM, 200 pairs/sec) → extract top-10.
  3. Stage 3 (LLM context): Assemble top-10 as context for LLM.

Total latency: ~500 ms (acceptable for RAG), with minimal accuracy loss versus single-stage reranking of all 500.

Reranking Integration in LlamaIndex

from llama_index.postprocessor import SentenceTransformerRerank

# Create reranker
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=5 # Keep top-5 after reranking
)

# Use in retrieval pipeline
from llama_index.indices import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=50)

# Rerank retrieved documents
query = "What is transformer attention?"
nodes = retriever.retrieve(query)
reranked_nodes = reranker.postprocess_nodes(nodes, query_str=query)

# Now reranked_nodes are sorted by cross-encoder relevance
for node in reranked_nodes:
print(f"Score: {node.score:.3f}, Text: {node.text[:50]}...")

Key Takeaways

  • Cross-encoders score document-query pairs jointly, capturing interactions that bi-encoders (independent encoding) miss.
  • Cross-encoder reranking improves RAG answer accuracy by 5–15% by surfacing the most relevant documents from a fused candidate set.
  • cross-encoder/ms-marco-MiniLM-L-6-v2 is the recommended general-purpose model: small (33M params), fast (500 pairs/sec), and accurate.
  • Reranking 50 documents adds ~150 ms latency, acceptable for most interactive RAG applications.
  • Fine-tuning a cross-encoder on domain-specific labeled data (100+ relevance pairs) improves accuracy another 5–10% but requires data collection effort.

Frequently Asked Questions

Should I rerank all retrieved documents or just top-k?

Rerank top-50 to top-100 (the BM25 + dense fusion candidates). Reranking all 1,000+ documents is expensive (>1 sec) with minimal gain. Reranking <10 documents is wasteful (fusion is already strong). Sweet spot: top-50 candidates with ~150 ms latency.

What if I only have dense retrieval (no BM25)?

Rerank the top-100 from dense retrieval. Cross-encoder reranking is especially valuable post-dense retrieval because dense retrieval often ranks semantically similar but factually less relevant documents highly. Hybrid + reranking is optimal, but dense + reranking alone beats dense-only by 5–10%.

Can I use an LLM as a reranker instead of a cross-encoder?

Yes, but it is expensive. An LLM prompt like "Rate relevance of this document to the query 0–10" works but costs thousands of times more than a cross-encoder (API call per document, vs. batched GPU forward pass). Use cross-encoders for reranking; reserve LLM calls for final generation.

How do I handle very long documents with cross-encoders?

Cross-encoders have a max token limit (typically 512). For documents longer than 512 tokens, split them into overlapping chunks (e.g., 400-token windows with 100-token overlap), score each chunk, and take the max score for the document. Or, summarize documents to <512 tokens before reranking.

What is the difference between cross-encoder and ColBERT?

ColBERT is a late-interaction retriever: it encodes documents and queries independently (like a bi-encoder) but computes relevance as fine-grained token-level interactions rather than single-vector similarity. It is faster than cross-encoders (supports approximate search) but more complex. For most RAG systems, cross-encoders are simpler and sufficient.

Further Reading