Skip to main content

Tuning Hybrid Search Weights and Retrieval Parameters

Tuning hybrid search parameters is an empirical process of optimizing retrieval quality on your specific corpus and query distribution. Key parameters include BM25's k1 (term saturation) and b (length normalization), the fusion weights (w_bm25, w_dense), and the top-k thresholds for each retrieval stage (initial retrieval, fusion, reranking). Without domain-specific tuning, a hybrid system operates at its default baseline performance; with systematic parameter optimization, you can achieve 5–15% accuracy improvements. The process requires a labeled evaluation set (50–200 queries with annotated relevant documents), a metric (NDCG, MAP, or MRR to measure ranking quality), and either grid search (exhaustive but reliable) or Bayesian optimization (adaptive and efficient). This article teaches you to build evaluation pipelines, design parameter grids, run experiments, and interpret results to tune your hybrid system for maximum answer relevance and minimal hallucination.

Setting Up Evaluation: Metrics and Baselines

The first step is establishing an evaluation framework. You need:

  1. Evaluation Set: 50–200 queries (ideally 100+) where relevant documents are labeled. Relevant documents are those that substantively answer the query.

  2. Relevance Judgments: Binary (relevant/irrelevant) or graded (0–3, where 0=not relevant, 3=perfectly relevant). Graded judgments are more informative but expensive.

  3. Metric: A ranking metric that measures how well your system ranks relevant documents high. The most common are:

MetricFormulaInterpretation
NDCG@ksum_{i=1}^{k} (2^rel_i - 1) / log2(i+1)Measures ranking quality at position-k, normalized by ideal ranking (0–1). Higher is better. Standard for retrieval.
MAP@k(1/k) * sum_{i=1}^{k} Precision@i * rel_iMean average precision: penalizes late-ranking relevant documents. (0–1).
MRR@k1 / rank_of_first_relevantMean reciprocal rank: focuses on first relevant document. (0–1). Useful for QA.
Recall@krelevant_in_top_k / total_relevantFraction of all relevant documents retrieved. (0–1).
Hit Rate@khas_any_relevant_in_top_kBinary: did top-k contain any relevant document? (0 or 1).

For RAG systems (where you need multiple relevant documents for context), NDCG@10 is the standard metric. For question-answering (where one good answer suffices), MRR@5 or Hit Rate@5 are more appropriate.

Building an Evaluation Dataset

import json
from pathlib import Path

def create_evaluation_set(queries: list[str]) -> dict:
"""
Manually annotate queries with relevant documents.
This is the labor-intensive part; consider crowdsourcing for scale.
"""
eval_set = {}

for query in queries:
# Retrieve candidates (e.g., top-100 from dense retrieval)
candidates = retrieve_candidates(query, top_k=100)

# Human annotator labels relevant documents
print(f"\nQuery: {query}")
relevant_docs = []
for i, (doc_id, text) in enumerate(candidates):
print(f"{i}. {text[:80]}...")
is_relevant = input("Relevant? (y/n): ").lower() == 'y'
if is_relevant:
relevant_docs.append(doc_id)

eval_set[query] = {
'relevant_docs': relevant_docs,
'candidates': [doc_id for doc_id, _ in candidates]
}

# Save evaluation set
with open('evaluation_set.json', 'w') as f:
json.dump(eval_set, f)

return eval_set

# For a larger evaluation set, use a crowdsourcing platform
# (Amazon Mechanical Turk, Scale AI) or hire annotators.
# Typical cost: $0.10–0.50 per query annotation.

Computing NDCG and Other Metrics

import math
from collections import defaultdict

def dcg(rankings: list[int], k: int = 10) -> float:
"""Compute Discounted Cumulative Gain"""
dcg_sum = 0.0
for i in range(min(k, len(rankings))):
rel = rankings[i] # 0 = not relevant, 1 = relevant
dcg_sum += (2 ** rel - 1) / math.log2(i + 2) # log2(i+2) because positions are 1-indexed
return dcg_sum

def ideal_dcg(num_relevant: int, k: int = 10) -> float:
"""Compute ideal DCG (all relevant docs ranked first)"""
return dcg([1] * min(num_relevant, k), k)

def ndcg(retrieved_docs: list[str], relevant_docs: set[str], k: int = 10) -> float:
"""Compute NDCG@k"""
# Create relevance judgments: 1 if doc is relevant, 0 otherwise
rankings = [1 if doc in relevant_docs else 0 for doc in retrieved_docs[:k]]

# Compute DCG and ideal DCG
dcg_score = dcg(rankings, k)
idcg = ideal_dcg(len(relevant_docs), k)

# NDCG is DCG normalized by ideal DCG
if idcg == 0:
return 0.0
return dcg_score / idcg

def mean_average_precision(retrieved_docs: list[str], relevant_docs: set[str], k: int = 10) -> float:
"""Compute MAP@k"""
precisions = []
num_relevant = 0

for i in range(min(k, len(retrieved_docs))):
if retrieved_docs[i] in relevant_docs:
num_relevant += 1
precision_at_i = num_relevant / (i + 1)
precisions.append(precision_at_i)

if len(precisions) == 0:
return 0.0
return sum(precisions) / len(relevant_docs)

# Evaluate on entire eval set
def evaluate_retriever(retriever_func, eval_set: dict, k: int = 10) -> dict:
"""Compute metrics across all queries in evaluation set"""
ndcg_scores = []
map_scores = []

for query, annotations in eval_set.items():
relevant_docs = set(annotations['relevant_docs'])

# Retrieve documents
retrieved = retriever_func(query, top_k=k)
retrieved_docs = [doc_id for doc_id, _, _ in retrieved]

# Compute metrics
ndcg_score = ndcg(retrieved_docs, relevant_docs, k)
map_score = mean_average_precision(retrieved_docs, relevant_docs, k)

ndcg_scores.append(ndcg_score)
map_scores.append(map_score)

return {
'NDCG@{}'.format(k): sum(ndcg_scores) / len(ndcg_scores),
'MAP@{}'.format(k): sum(map_scores) / len(map_scores),
'Count': len(eval_set)
}

# Example
baseline_results = evaluate_retriever(hybrid_retrieve, eval_set, k=10)
print(f"Baseline NDCG@10: {baseline_results['NDCG@10']:.3f}")
print(f"Baseline MAP@10: {baseline_results['MAP@10']:.3f}")

Once you have a baseline metric, systematically tune parameters to improve it.

from itertools import product

def grid_search_bm25_params(eval_set: dict, bm25_index):
"""Grid search over BM25 k1 and b parameters"""

param_grid = {
'k1': [0.5, 1.0, 1.5, 2.0, 2.5],
'b': [0.5, 0.65, 0.75, 0.85, 1.0]
}

best_params = None
best_ndcg = 0.0
results = []

for k1, b in product(param_grid['k1'], param_grid['b']):
# Create BM25 retriever with current parameters
def retrieve_with_params(query, top_k=50):
return bm25_index.search(query, top_k=top_k, k1=k1, b=b)

# Evaluate on eval set
metrics = evaluate_retriever(retrieve_with_params, eval_set)
ndcg = metrics['NDCG@10']

results.append({
'k1': k1,
'b': b,
'NDCG@10': ndcg
})

if ndcg > best_ndcg:
best_ndcg = ndcg
best_params = (k1, b)

print(f"k1={k1}, b={b}: NDCG@10={ndcg:.4f}")

return best_params, best_ndcg, results

# Example
best_params, best_ndcg, results = grid_search_bm25_params(eval_set, bm25_index)
print(f"\nBest BM25 params: k1={best_params[0]}, b={best_params[1]}")
print(f"Best NDCG@10: {best_ndcg:.4f}")

# Visualize results
import pandas as pd
df = pd.DataFrame(results)
pivot = df.pivot(index='k1', columns='b', values='NDCG@10')
print("\nNDCG@10 heatmap (k1 vs b):")
print(pivot)

Tuning Fusion Weights (Weighted Normalization)

If you choose weighted fusion instead of RRF, optimize the weights:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import numpy as np

def optimize_fusion_weights(eval_set: dict, bm25_index, dense_retriever):
"""Learn optimal fusion weights from labeled data"""

X = [] # Features: [norm_bm25_score, norm_dense_score]
y = [] # Labels: 1 if relevant, 0 otherwise

# Prepare training data
for query, annotations in eval_set.items():
relevant_docs = set(annotations['relevant_docs'])

# Retrieve from both methods
bm25_results = bm25_index.search(query, top_k=100)
dense_results = dense_retriever.search(query, top_k=100)

# Normalize scores
bm25_scores = {doc_id: score for doc_id, _, score in bm25_results}
dense_scores = {doc_id: score for doc_id, _, score in dense_results}

bm25_norm = min_max_normalize(list(bm25_scores.values()))
dense_norm = min_max_normalize(list(dense_scores.values()))

# Normalize individual scores
bm25_min, bm25_max = min(bm25_scores.values()), max(bm25_scores.values())
dense_min, dense_max = min(dense_scores.values()), max(dense_scores.values())

# Create feature vectors for all docs in union set
all_docs = set(bm25_scores.keys()) | set(dense_scores.keys())
for doc_id in all_docs:
norm_bm25 = (bm25_scores.get(doc_id, 0) - bm25_min) / (bm25_max - bm25_min) if bm25_max > bm25_min else 0
norm_dense = (dense_scores.get(doc_id, 0) - dense_min) / (dense_max - dense_min) if dense_max > dense_min else 0

X.append([norm_bm25, norm_dense])
y.append(1 if doc_id in relevant_docs else 0)

# Train classifier to learn feature importance (weights)
X = np.array(X)
y = np.array(y)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

clf = LogisticRegression()
clf.fit(X_scaled, y)

# Extract weights from logistic regression coefficients
weights = clf.coef_[0]
w_bm25, w_dense = weights / weights.sum() # Normalize to sum to 1

print(f"Optimal fusion weights: w_bm25={w_bm25:.3f}, w_dense={w_dense:.3f}")

return w_bm25, w_dense

# Example
w_bm25, w_dense = optimize_fusion_weights(eval_set, bm25_index, dense_retriever)

Tuning top-k Thresholds

The number of candidates retrieved at each stage impacts latency and accuracy:

def optimize_top_k_thresholds(eval_set: dict, retriever_func):
"""Find optimal top-k values for retrieval stages"""

results = {}

for top_k in [10, 20, 50, 100, 200, 500]:
def retrieve_with_k(query):
return retriever_func(query, top_k=top_k)

metrics = evaluate_retriever(retrieve_with_k, eval_set, k=10)
ndcg = metrics['NDCG@10']

results[top_k] = ndcg
print(f"top_k={top_k}: NDCG@10={ndcg:.4f}")

# Analyze trade-off between k and accuracy
# Typically, NDCG plateaus after k=50–100
return results

Typical findings:

  • top_k=10–20: Fast but recall may be limited (some relevant docs outside top-20).
  • top_k=50: Sweet spot for most systems; balances recall and latency.
  • top_k=100–200: Marginal accuracy gains over k=50 but double the latency.

End-to-End Optimization Example

def tune_full_hybrid_pipeline(eval_set: dict, bm25_index, dense_retriever, reranker):
"""Comprehensive tuning of entire hybrid pipeline"""

print("Step 1: Tune BM25 parameters")
best_bm25_params, _, _ = grid_search_bm25_params(eval_set, bm25_index)
print(f"Best k1={best_bm25_params[0]}, b={best_bm25_params[1]}\n")

print("Step 2: Optimize fusion weights")
w_bm25, w_dense = optimize_fusion_weights(eval_set, bm25_index, dense_retriever)
print(f"Best w_bm25={w_bm25:.3f}, w_dense={w_dense:.3f}\n")

print("Step 3: Find optimal top-k thresholds")
top_k_results = optimize_top_k_thresholds(eval_set,
lambda q, top_k=50: hybrid_retrieve(q, top_k, w_bm25, w_dense))

print("\nFinal tuned parameters:")
print(f"BM25: k1={best_bm25_params[0]}, b={best_bm25_params[1]}")
print(f"Fusion: w_bm25={w_bm25:.3f}, w_dense={w_dense:.3f}")
print(f"Top-k: 50 (optimal)")

tune_full_hybrid_pipeline(eval_set, bm25_index, dense_retriever, reranker)

Key Takeaways

  • Parameter tuning requires a labeled evaluation set (50–200 queries) and a metric (NDCG@10 for retrieval, MRR@5 for QA).
  • Grid search over BM25's k1 and b parameters typically yields 2–5% NDCG improvements; larger improvements come from architecture changes (e.g., adding reranking).
  • Fusion weights can be learned from labeled data via logistic regression, improving hybrid accuracy by 2–3% over fixed weights.
  • Top-k thresholds plateau quickly: k=50 captures ~95% of the gains of k=200 with 4× lower latency.
  • Tuning is an iterative process: establish baseline, tune one component, measure improvement, move to the next.

Frequently Asked Questions

How many evaluation queries do I need for meaningful tuning?

Start with 50 queries for initial parameter exploration. For production systems, aim for 100–200 queries representing diverse intent. More queries provide better statistical confidence, especially for detecting small improvements (1–2%).

Should I use grid search or Bayesian optimization?

Grid search is simpler, fully interpretable, and sufficient for 2–3 parameters. Use it to understand parameter sensitivity (e.g., "NDCG is higher for k1=1.5 than k1=1.0"). Bayesian optimization is overkill for hybrid search tuning; it shines when optimizing 5+ interdependent parameters in deep learning.

How do I prevent overfitting my parameters to the evaluation set?

Hold out a separate test set (20–30% of queries) for final evaluation. Tune on the training set, then measure on the test set. If test NDCG is significantly lower than training NDCG, you have overfitting; use simpler models or more training data.

What if my evaluation set has only binary judgments (relevant/irrelevant)?

Binary judgments work fine for NDCG and MAP. Graded judgments (0–3) are slightly better because they capture "highly relevant" vs. "marginally relevant" distinctions, but binary is sufficient for tuning. The metric still works; results are just coarser.

How often should I re-tune parameters?

Re-tune every 6–12 months or when your corpus/query distribution changes significantly (e.g., new document types, seasonal query shift). Minor updates to the corpus do not require re-tuning.

Further Reading