Cross-Encoder Reranking: Pairwise Relevance Scoring
Cross-encoder reranking is a neural retrieval method that scores document-query pairs directly for relevance, taking documents retrieved by BM25 or dense retrieval and re-ranking them to surface the most relevant ones. Unlike dense retrievers (which encode queries and documents independently in the same vector space), a cross-encoder receives the full query-document pair as input and outputs a single relevance score. This pairwise architecture allows the model to attend to all interactions between query tokens and document tokens, capturing subtle relevance signals that single-vector-space methods miss. A cross-encoder adds 5–15% accuracy improvement over fusion-only hybrid retrieval, with a computational cost of reranking 50–100 documents per query (<500ms latency). For RAG systems where answer quality is paramount, cross-encoder reranking is the final stage of multi-stage ranking, ensuring the LLM receives the most relevant grounded evidence.
Cross-Encoder vs. Bi-Encoder: Architectural Differences
Bi-Encoder (Dense Retrieval):
- Separately encodes query and document into independent vectors.
- Similarity is computed as geometric distance (e.g., cosine similarity) in vector space.
- Supports fast approximate nearest neighbor search (ANN).
- Efficient for large-scale retrieval (millions of documents) but coarse-grained ranking.
- Example: all-MiniLM-L6-v2, text-embedding-3-small.
Cross-Encoder:
- Concatenates query and document and passes them through a shared transformer.
- Outputs a single relevance score per pair via a classifier head.
- Requires computing scores for every candidate document (no ANN).
- Slower but more accurate for re-ranking a fixed candidate set (50–100 documents).
- Example: cross-encoder/ms-marco-MiniLM-L-6-v2, cross-encoder/qnli-distilroberta-base.
Workflow: Dense retrieval (bi-encoder) efficiently retrieves top-100 candidates → cross-encoder reranks top-100 to extract top-5 for LLM context.
Cross-Encoder Models: Selection and Trade-offs
Popular open-source cross-encoder models and benchmarks:
| Model | Size | Speed (pairs/sec) | Accuracy | Use Case |
|---|---|---|---|---|
| cross-encoder/ms-marco-MiniLM-L-6-v2 | 33M | 500 | 8.1/10 | General-purpose, balanced |
| cross-encoder/ms-marco-TinyBERT-L-2-v2 | 14M | 2,000 | 7.6/10 | Low-latency, mobile |
| cross-encoder/ms-marco-ELECTRA-Base | 110M | 100 | 8.7/10 | High accuracy, if latency permits |
| cross-encoder/qnli-distilroberta-base | 82M | 200 | 8.4/10 | QA-optimized |
For most RAG systems, cross-encoder/ms-marco-MiniLM-L-6-v2 is the default: 33M parameters, 500 document pairs scored per second on a single GPU, and strong accuracy. For mobile/serverless with <50ms latency budgets, use TinyBERT. For high-stakes applications (medical, legal), upgrade to ELECTRA-Base if 10ms latency per document is acceptable.
How Cross-Encoder Reranking Works
A cross-encoder processes the input as:
[CLS] <query tokens> [SEP] <document tokens> [SEP]
The model forward-passes this token sequence through 6–12 transformer layers, pooling the [CLS] token's final representation and feeding it to a classification head (typically a linear layer + sigmoid) that outputs a score in [0, 1] or [0, infinity] depending on the output activation.
For a query "What is transformer attention mechanism?" and document "Transformer models use attention to weight input tokens", the cross-encoder:
- Tokenizes both into subword tokens.
- Concatenates with special tokens: [CLS] what is transformer attention [SEP] transformer models use attention [SEP]
- Forward-passes through transformer layers, allowing attention heads to compute query-document interactions.
- Outputs relevance score (e.g., 0.92), indicating high relevance.
This contrasts with dense retrieval, where the query and document are encoded in isolation and later compared geometrically. Cross-encoders' joint encoding captures interactions like token co-occurrence, syntactic alignment, and semantic coherence within the pair.
Implementation: Reranking a Hybrid Retrieval Result
from sentence_transformers import CrossEncoder
import numpy as np
# Initialize cross-encoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Hybrid-fused document candidates (from RRF)
candidates = [
('doc_1', 'Transformer attention mechanism is a core component of modern NLP.'),
('doc_2', 'The attention mechanism in transformer networks scales quadratically.'),
('doc_3', 'BERT uses transformer architecture for natural language processing.'),
('doc_4', 'Recurrent networks are an alternative to transformer-based models.'),
('doc_5', 'Attention is a key concept in machine learning broadly.'),
]
query = "What is transformer attention mechanism?"
# Prepare pairs: [(query, document1), (query, document2), ...]
pairs = [(query, doc_text) for _, doc_text in candidates]
# Score all pairs
scores = model.predict(pairs) # Returns array of [num_candidates]
# scores = array([0.92, 0.87, 0.85, 0.62, 0.45])
# Rerank: create tuples of (doc_id, original_text, rerank_score)
reranked = sorted(
[(candidates[i][0], candidates[i][1], scores[i]) for i in range(len(candidates))],
key=lambda x: x[2],
reverse=True
)
print("Reranked results:")
for rank, (doc_id, text, score) in enumerate(reranked, 1):
print(f"{rank}. Score: {score:.3f}, Doc: {text[:60]}...")
# Output:
# 1. Score: 0.920, Doc: Transformer attention mechanism is a core...
# 2. Score: 0.870, Doc: The attention mechanism in transformer...
# 3. Score: 0.850, Doc: BERT uses transformer architecture...
# 4. Score: 0.620, Doc: Recurrent networks are an alternative...
# 5. Score: 0.450, Doc: Attention is a key concept...
Notice how the cross-encoder correctly ranks doc_1 (exact keyword match + semantic alignment) highest, doc_2 (keyword match but less direct) second, and doc_5 (semantic but vague, lacking "transformer") lowest.
Computational Cost and Latency Analysis
Reranking N documents with a cross-encoder costs:
Latency: ~2–4 ms per document pair (including tokenization and forward pass). Reranking 50 documents = 100–200 ms. Reranking 100 documents = 200–400 ms.
Memory: ~500 MB GPU VRAM for MiniLM variant in FP32; scales with batch size. Typical batch size (32–64) uses <2 GB VRAM.
Total RAG latency example:
- BM25 retrieval: 10 ms
- Dense retrieval: 50 ms
- Fuse (RRF): 5 ms
- Rerank top-50: 150 ms
- Total: ~215 ms (acceptable for interactive applications)
If latency is critical (<100 ms), rerank only top-20 (skip the bottom-30 of the fused list) with minimal accuracy cost, or use a smaller cross-encoder model (TinyBERT: ~0.5 ms per pair).
When to Apply Reranking
Reranking is most valuable:
- Top-k is large (50+): Fusion alone produces a ranked list with many documents; reranking refines ranking significantly.
- Query is ambiguous: Multi-faceted queries (e.g., "impact of AI on society") benefit from neural reranking's ability to capture subtle semantic alignment.
- Answer quality is critical: Legal/medical/financial RAG systems where wrong documents risk serious errors.
- Latency budget exists (>200 ms): If you need <100 ms end-to-end, rerank only top-10 or skip reranking.
Reranking provides diminishing returns on pre-ranked lists of <10 documents, especially if the pre-ranker (fusion) is already strong.
Fine-Tuning Cross-Encoders for Your Domain
If you have labeled relevance data (queries with human-annotated relevant/irrelevant documents), you can fine-tune a cross-encoder to your domain, improving accuracy by 5–10%.
from sentence_transformers import CrossEncoder, InputExample, losses
from torch.utils.data import DataLoader
# Training data: [(query, document, relevance_label), ...]
train_samples = [
InputExample(texts=["What is NLP?", "NLP is a subfield of AI..."], label=1.0),
InputExample(texts=["What is NLP?", "Cooking is an art form..."], label=0.0),
InputExample(texts=["Transformer attention?", "Attention mechanisms..."], label=0.95),
# ... more examples
]
# Load pre-trained model
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Create training dataloader
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=32)
# Define loss (contrastive or triplet)
train_loss = losses.CosineSimilarityLoss(model)
# Fine-tune (a few epochs)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
show_progress_bar=True
)
# Save fine-tuned model
model.save('cross-encoder/my-domain-reranker')
For a production domain-specific RAG system with access to labeled data, fine-tuning a cross-encoder is a high-ROI investment (100–500 labeled pairs achieves meaningful improvement).
Multi-Stage Reranking: Coarse-to-Fine
For very large candidate sets (500+ documents), apply reranking in multiple stages to control latency:
- Stage 1 (coarse): Rerank top-500 with a small, fast cross-encoder (TinyBERT, 500 pairs/sec) → extract top-100.
- Stage 2 (fine): Rerank top-100 with a larger cross-encoder (MiniLM, 200 pairs/sec) → extract top-10.
- Stage 3 (LLM context): Assemble top-10 as context for LLM.
Total latency: ~500 ms (acceptable for RAG), with minimal accuracy loss versus single-stage reranking of all 500.
Reranking Integration in LlamaIndex
from llama_index.postprocessor import SentenceTransformerRerank
# Create reranker
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=5 # Keep top-5 after reranking
)
# Use in retrieval pipeline
from llama_index.indices import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=50)
# Rerank retrieved documents
query = "What is transformer attention?"
nodes = retriever.retrieve(query)
reranked_nodes = reranker.postprocess_nodes(nodes, query_str=query)
# Now reranked_nodes are sorted by cross-encoder relevance
for node in reranked_nodes:
print(f"Score: {node.score:.3f}, Text: {node.text[:50]}...")
Key Takeaways
- Cross-encoders score document-query pairs jointly, capturing interactions that bi-encoders (independent encoding) miss.
- Cross-encoder reranking improves RAG answer accuracy by 5–15% by surfacing the most relevant documents from a fused candidate set.
cross-encoder/ms-marco-MiniLM-L-6-v2is the recommended general-purpose model: small (33M params), fast (500 pairs/sec), and accurate.- Reranking 50 documents adds ~150 ms latency, acceptable for most interactive RAG applications.
- Fine-tuning a cross-encoder on domain-specific labeled data (100+ relevance pairs) improves accuracy another 5–10% but requires data collection effort.
Frequently Asked Questions
Should I rerank all retrieved documents or just top-k?
Rerank top-50 to top-100 (the BM25 + dense fusion candidates). Reranking all 1,000+ documents is expensive (>1 sec) with minimal gain. Reranking <10 documents is wasteful (fusion is already strong). Sweet spot: top-50 candidates with ~150 ms latency.
What if I only have dense retrieval (no BM25)?
Rerank the top-100 from dense retrieval. Cross-encoder reranking is especially valuable post-dense retrieval because dense retrieval often ranks semantically similar but factually less relevant documents highly. Hybrid + reranking is optimal, but dense + reranking alone beats dense-only by 5–10%.
Can I use an LLM as a reranker instead of a cross-encoder?
Yes, but it is expensive. An LLM prompt like "Rate relevance of this document to the query 0–10" works but costs thousands of times more than a cross-encoder (API call per document, vs. batched GPU forward pass). Use cross-encoders for reranking; reserve LLM calls for final generation.
How do I handle very long documents with cross-encoders?
Cross-encoders have a max token limit (typically 512). For documents longer than 512 tokens, split them into overlapping chunks (e.g., 400-token windows with 100-token overlap), score each chunk, and take the max score for the document. Or, summarize documents to <512 tokens before reranking.
What is the difference between cross-encoder and ColBERT?
ColBERT is a late-interaction retriever: it encodes documents and queries independently (like a bi-encoder) but computes relevance as fine-grained token-level interactions rather than single-vector similarity. It is faster than cross-encoders (supports approximate search) but more complex. For most RAG systems, cross-encoders are simpler and sufficient.
Further Reading
- Sentence Transformers CrossEncoder Documentation — Official library and model hub
- MS MARCO Cross-Encoder Leaderboard — Benchmark and model card
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction (Khattab & Zaharia, 2020) — Alternative architecture for efficient reranking
- LlamaIndex Postprocessor Documentation — Integration with RAG pipelines