Skip to main content

Reciprocal Rank Fusion: Merging Rankings by Position

Reciprocal Rank Fusion (RRF) is the gold-standard rank-based fusion algorithm for combining multiple rankings, particularly in hybrid search where BM25 and dense retrieval rankings must be merged. Unlike score-based fusion (which requires normalizing heterogeneous score scales), RRF operates purely on rank positions: a document ranked #1 in one list and #5 in another receives a fusion score based on those positions, independent of the original scores. The algorithm is parameter-free, theoretically grounded, and empirically robust: published benchmarks show RRF fusion performs within 2–3% of the more complex learned fusion methods, with negligible tuning overhead. For RAG systems that lack labeled evaluation data, RRF is the optimal choice, balancing accuracy and implementation simplicity.

RRF Algorithm: Definition and Mathematical Foundation

Reciprocal Rank Fusion assigns each document a fusion score summed across all ranking methods:

RRF(d) = sum_{m=1}^{M} 1 / (k + rank_m(d))

where:

  • M is the number of ranking methods (typically 2: BM25 and dense).
  • rank_m(d) is the 1-indexed position of document d in method m's ranking.
  • k is a constant (typically 60) that dampens the influence of early-rank positions.
  • If document d does not appear in method m's top-k results, its contribution from method m is 0.

The algorithm was introduced by Cormack, Clarke, and Buettcher in 2009 for combining IR system results. Its elegance lies in two properties:

  1. Rank Invariance: The fusion score depends only on rank positions, not absolute scores, making RRF immune to score scale differences.
  2. Robustness: The constant k acts as a dampening factor. With k=60, even documents ranked #1 (score 1/61 ≈ 0.016) and #100 (score 1/160 ≈ 0.006) have reasonably close contributions, preventing outlier-sensitive behavior.

Why k=60? Parameter Sensitivity

The choice of k balances two competing goals: rewarding documents that rank highly in multiple methods while maintaining robustness to ranking disagreements.

For k=10 (aggressive early-rank weighting):

  • Document ranked #1 in both methods: 1/11 + 1/11 ≈ 0.182
  • Document ranked #10 in both methods: 1/20 + 1/20 = 0.10
  • Rank ratio: 0.182 / 0.10 = 1.82× amplification

For k=60 (conservative, default):

  • Document ranked #1 in both methods: 1/61 + 1/61 ≈ 0.033
  • Document ranked #10 in both methods: 1/70 + 1/70 ≈ 0.029
  • Rank ratio: 0.033 / 0.029 = 1.14× amplification (much gentler)

Theoretical analysis by Cormack et al. shows k in [40, 100] are near-equivalent, with k=60 as the calibrated default. For highly disagreement-prone scenarios (e.g., BM25 and dense retrieval pull from very different document sets), increase k to 100. For agreement-heavy scenarios (both methods rank the same documents), k=60 remains optimal.

RRF Step-by-Step Example

Consider a user query "transformer attention mechanism" with:

BM25 top-5:

  1. doc_a (score 35.2)
  2. doc_b (score 28.1)
  3. doc_c (score 22.4)
  4. doc_d (score 19.8)
  5. doc_e (score 15.1)

Dense retrieval top-5:

  1. doc_a (score 0.89)
  2. doc_c (score 0.85)
  3. doc_f (score 0.81)
  4. doc_b (score 0.78)
  5. doc_g (score 0.75)

RRF fusion (k=60):

DocumentBM25 RankDense RankRRF Score
doc_a111/61 + 1/61 ≈ 0.0328
doc_b241/62 + 1/64 ≈ 0.0315
doc_c321/63 + 1/62 ≈ 0.0317
doc_d41/64 + 0 ≈ 0.0156
doc_e51/65 + 0 ≈ 0.0154
doc_f30 + 1/63 ≈ 0.0159
doc_g50 + 1/65 ≈ 0.0154

Final ranking (by RRF score):

  1. doc_a (0.0328)
  2. doc_c (0.0317)
  3. doc_b (0.0315)
  4. doc_d (0.0156)
  5. doc_f (0.0159) [note: ranks after doc_d by score, but appears in only one method]

The fusion successfully re-ranks doc_c above doc_b because doc_c performs better in dense retrieval (rank 2) than doc_b (rank 4), offsetting doc_b's stronger BM25 position. Documents appearing in both lists (doc_a, doc_b, doc_c) score highest—the hallmark of RRF.

RRF Implementation in Python and Production Systems

def reciprocal_rank_fusion(ranking_lists, k=60):
"""
Perform Reciprocal Rank Fusion over multiple ranked lists.

Args:
ranking_lists: List of lists, each containing document IDs in ranked order.
e.g., [['doc_a', 'doc_b', 'doc_c'], ['doc_a', 'doc_c', 'doc_f']]
k: RRF constant (default 60)

Returns:
List of (doc_id, rrf_score) tuples sorted by score descending.
"""
rrf_scores = {}

for ranking in ranking_lists:
for rank, doc_id in enumerate(ranking, 1): # 1-indexed
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank)

# Sort by RRF score descending
fused_ranking = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return fused_ranking

# Example with BM25 and dense retrieval
bm25_ranking = ['doc_a', 'doc_b', 'doc_c', 'doc_d', 'doc_e']
dense_ranking = ['doc_a', 'doc_c', 'doc_f', 'doc_b', 'doc_g']

fused = reciprocal_rank_fusion([bm25_ranking, dense_ranking])
print("Fused ranking:")
for i, (doc, score) in enumerate(fused, 1):
print(f"{i}. {doc} (RRF={score:.4f})")

# Output:
# 1. doc_a (RRF=0.0328)
# 2. doc_c (RRF=0.0317)
# 3. doc_b (RRF=0.0315)
# ...

For production RAG systems, RRF is often integrated directly into retrieval frameworks:

from llama_index.retrievers import QueryFusionRetriever
from llama_index.retrievers import BM25Retriever, VectorIndexRetriever

# Create individual retrievers
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, bm25_kwargs={"k1": 1.5, "b": 0.75})
dense_retriever = VectorIndexRetriever(index=vector_index)

# Fuse with RRF
fusion_retriever = QueryFusionRetriever([bm25_retriever, dense_retriever], mode="rrf")

# Query
query = "What is transformer attention?"
top_results = fusion_retriever.retrieve(query, top_k=5)

for i, result in enumerate(top_results, 1):
print(f"{i}. Score: {result.score:.4f}, Text: {result.text[:100]}...")

RRF vs. Score-Based Fusion Benchmarks

Comparative benchmark on MS MARCO (large-scale retrieval corpus):

Fusion MethodNDCG@10MRR@10Tuning Required
BM25 only0.2870.291No
Dense only0.3400.347No
Reciprocal Rank Fusion (RRF)0.3650.368No
Min-Max Norm + Weighted Avg0.3610.365Yes
Learned Fusion (LTR)0.3740.376Yes (requires labels)

RRF achieves 98% of learned fusion's accuracy (0.365 vs 0.374 NDCG@10) with zero tuning, making it the practical choice for most applications. Learned fusion gains only 2–3% with substantial labeling overhead.

When to Adjust k: Domain-Specific Tuning

While k=60 is robust across domains, edge cases justify tuning:

Increase k to 80–100 when:

  • Document retrieval lists from BM25 and dense are very different (low overlap).
  • Your corpus has high linguistic diversity (multiple ways to express the same concept).
  • You want to reduce the penalty for documents appearing in only one method's top results.

Decrease k to 40–50 when:

  • BM25 and dense retrievers agree strongly (high overlap in top-k results).
  • Your corpus is homogeneous (technical documentation, FAQs).
  • You want to strongly reward documents that rank highly in multiple methods.

To determine optimal k empirically, if you have a small evaluation set (10–20 queries with gold-standard relevant documents), compute NDCG or MAP for k in [40, 60, 80, 100]. Typically, differences are <1%, confirming k=60's robustness.

Extending RRF: Multi-Method Fusion

RRF scales naturally to 3+ ranking methods:

# Three retrieval methods: BM25, dense, and learned sparse (SPLADE)
bm25_ranking = ['doc_a', 'doc_b', 'doc_c', 'doc_d']
dense_ranking = ['doc_a', 'doc_c', 'doc_f', 'doc_b']
sparse_ranking = ['doc_b', 'doc_a', 'doc_g', 'doc_h']

fused = reciprocal_rank_fusion([bm25_ranking, dense_ranking, sparse_ranking], k=60)
# Each document's score sums contributions from up to 3 methods

Adding a third method (e.g., learned sparse SPLADE) typically improves NDCG by 1–3%, confirming value, but increases latency by ~33% (parallel latency for the third method).

Key Takeaways

  • Reciprocal Rank Fusion assigns each document a score based on its rank position in multiple lists, independent of absolute scores.
  • The formula 1 / (k + rank) with k=60 (default) is theoretically grounded and empirically robust across diverse domains.
  • RRF achieves 98% of learned fusion accuracy (0.365 vs 0.374 NDCG@10 on MS MARCO) with zero tuning, making it the practical hybrid fusion choice.
  • RRF improves over BM25 alone by ~27% (0.287 to 0.365 NDCG@10), demonstrating substantial value of hybrid fusion.
  • RRF extends naturally to 3+ methods; each additional method adds 1–3% accuracy but increases latency proportionally.

Frequently Asked Questions

Should I tune k for my specific corpus?

Probably not. Theoretical and empirical analysis shows k in [40, 100] are near-equivalent. Unless you have labeled evaluation data and k=60 underperforms by >2%, stick with k=60. Tuning k on an unlabeled corpus risks overfitting to noise.

What if one retrieval method returns very different k results than another?

RRF handles this naturally. A document appearing only in method A's top-50 but not method B's top-50 receives contribution from A only, lowering its final score. This is by design—RRF rewards agreement. If you want to penalize disagreement less, increase k.

How does RRF compare to just averaging normalized scores?

RRF outperforms score averaging by 1–2% on benchmarks because it is robust to score scale differences without requiring explicit normalization. Score averaging, if not normalized carefully, can be dominated by one method's score range. RRF avoids this by using positions, not scores.

Can I use RRF with weighted contributions from each method?

Yes. Modify the formula to weighted_RRF(d) = sum w_m * 1 / (k + rank_m(d)) where w_m is the weight for method m (e.g., w_bm25=0.5, w_dense=0.5 for equal weighting, or w_bm25=0.6, w_dense=0.4 if BM25 is preferred). Weights sum to 1 or are normalized post-computation. This adds one tuning parameter but improves results 2–3% if you have labeled data.

How do I set k if I have only a few documents per ranking (e.g., top-5)?

Use k = 2 * top_k to maintain similar dampening. If each method returns top-5 results, use k=10. The absolute magnitude of k matters less than its relationship to the typical rank range you observe.

Further Reading