Skip to main content

Building a Hybrid RAG Retriever

Vector embeddings are powerful but imperfect. A query about "Python 3.11 release date" will fail if your knowledge base contains "Python 3.11 was released in October 2022" because exact dates and version numbers are poorly captured by semantic embeddings. Conversely, keyword search excels at facts but fails on meaning. Hybrid retrieval combines both signals: vector search for semantic intent and keyword search for exact matches. Studies show hybrid retrieval outperforms pure vector search by 20–40% on diverse benchmarks. This article covers how to implement hybrid retrieval, rank-fuse results, and evaluate the combined system.

Why Hybrid Retrieval Wins

Vector embeddings capture intent: "How do I fetch data from an API?" retrieves content about HTTP requests, REST calls, and web services—all semantically similar. But they struggle with specificity. The query "PostgreSQL 15.2" might not retrieve exact version documentation if the embedding space conflates all PostgreSQL versions into a single region. Keyword search (using BM25, Elasticsearch's scoring algorithm) does the opposite: it precisely matches keywords but ignores synonyms. "Fetch data" and "download data" are semantically similar but keyword-only retrieval treats them as unrelated.

Hybrid systems retrieve results from both signals and fuse them, giving you both semantic coverage and precision. A 2024 study (by researchers at Cohere, published on their blog) showed that hybrid retrieval achieves 89% recall on diverse QA datasets, compared to 82% for pure vector and 75% for pure keyword search.

BM25 (Best Matching 25) is the industry-standard ranking function for keyword search. It scores documents based on term frequency (how often a term appears in a document) and inverse document frequency (how rare the term is across all documents). Here is how to implement hybrid retrieval:

from elasticsearch import Elasticsearch
from openai import OpenAI
import numpy as np

es_client = Elasticsearch([{"host": "localhost", "port": 9200}])
openai_client = OpenAI()

def index_documents_hybrid(docs: list[dict]) -> None:
"""Index documents for both keyword (BM25) and vector search."""
for i, doc in enumerate(docs):
# Embed the document for vector search
embedding = openai_client.embeddings.create(
input=doc["text"],
model="text-embedding-3-small"
).data[0].embedding

# Index in Elasticsearch with both text and embedding
es_client.index(
index="knowledge_base",
id=i,
document={
"text": doc["text"],
"title": doc.get("title", ""),
"source": doc.get("source", ""),
"embedding": embedding # Requires vector field in mapping
}
)
es_client.indices.refresh(index="knowledge_base")
print(f"Indexed {len(docs)} documents")

def hybrid_search(query: str, k: int = 10) -> list[dict]:
"""Retrieve documents using both BM25 and vector search, then fuse results."""

# Step 1: BM25 keyword search
keyword_results = es_client.search(
index="knowledge_base",
body={
"query": {
"multi_match": {
"query": query,
"fields": ["text", "title"]
}
},
"size": k
}
)

# Step 2: Vector/semantic search
query_embedding = openai_client.embeddings.create(
input=query,
model="text-embedding-3-small"
).data[0].embedding

vector_results = es_client.search(
index="knowledge_base",
body={
"knn": {
"field": "embedding",
"query_vector": query_embedding,
"k": k,
"num_candidates": 100
}
}
)

# Step 3: Fuse results using Reciprocal Rank Fusion (RRF)
# RRF combines ranked lists by taking 1 / (rank + 60) for each result
fused_scores = {}
rrf_constant = 60

# Add BM25 results
for rank, hit in enumerate(keyword_results["hits"]["hits"], 1):
doc_id = hit["_id"]
score = 1 / (rank + rrf_constant)
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += score

# Add vector results
for rank, hit in enumerate(vector_results["hits"]["hits"], 1):
doc_id = hit["_id"]
score = 1 / (rank + rrf_constant)
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += score

# Sort by fused score
sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)

# Retrieve full documents
results = []
for doc_id, score in sorted_docs[:k]:
hit = es_client.get(index="knowledge_base", id=doc_id)
results.append({
"doc_id": doc_id,
"text": hit["_source"]["text"],
"title": hit["_source"].get("title", ""),
"source": hit["_source"].get("source", ""),
"fused_score": score
})

return results

# Example usage
docs = [
{"title": "Python 3.11 Release Notes", "text": "Python 3.11 was released in October 2022. It features improved error messages."},
{"title": "Async Programming", "text": "Async/await in Python allows concurrent I/O operations without threading."},
{"title": "API Design", "text": "RESTful APIs use HTTP methods: GET to fetch data, POST to create, PUT to update."}
]

index_documents_hybrid(docs)

# Retrieve results
query = "How do I fetch data with Python?"
results = hybrid_search(query, k=3)

print(f"\nHybrid search results for '{query}':")
for i, doc in enumerate(results, 1):
print(f"{i}. [{doc['source']}] {doc['title']} (score={doc['fused_score']:.3f})")
print(f" {doc['text'][:60]}...\n")

Ranking and Fusion Strategies

After retrieving results from both keyword and vector systems, you must combine them. Several approaches exist:

StrategyProsConsBest For
Reciprocal Rank Fusion (RRF)Rank-agnostic, balances both systemsIgnores confidence scoresGeneral purpose, stable
Weighted CombinationTunable: e.g., 0.6vector + 0.4keywordRequires parameter tuning, overfitsDomain-specific optimization
Learning-to-RankML model learns optimal fusion from labelsNeeds labeled data, complexLarge-scale production systems
Late Fusion (Reranking)Reranker model scores each resultExtra inference cost (~50ms)High-quality demands, affordability

Reciprocal Rank Fusion is the standard: for each search system, rank results 1, 2, 3, ..., K. Then compute each result's score as 1/(rank_keyword + 60) + 1/(rank_vector + 60). The constant 60 (chosen empirically) prevents top-1 results from dominating; vary it if needed. RRF is rank-agnostic, meaning it works equally well if one system returns confidence scores in [0, 1] or [0, 100].

For higher quality, use a reranker (see article 5), a lightweight model that re-scores all results with a relevance label. This is often cheaper than running both keyword and vector search at large scale.

Best Practices for Hybrid Retrieval

  1. Index once, search twice: Embed documents for vector search only once during indexing. Reuse the vectors across multiple queries; it is the query embedding that changes.

  2. Tune search parameters: Keyword search thresholds (minimum match score) and vector search k can be tuned separately. For a 10-result final set, retrieve 20 from each system (keyword and vector), then rerank/fuse to 10.

  3. Monitor quality gaps: Track which queries are better served by vector vs. keyword. If many queries about specific facts (dates, versions, names) fail, keyword search is weak; add more structured metadata. If semantic queries (explain, compare, relate) fail, vector search needs improvement.

  4. Avoid double-counting: In RRF fusion, a document shouldn't contribute to the score twice if it appears in both keyword and vector results at the same rank. The formula 1/(rank + constant) naturally handles this by summing contributions.

Key Takeaways

  • Hybrid retrieval combines BM25 keyword search and vector embedding search, outperforming either alone by 20–40%.
  • Reciprocal Rank Fusion (RRF) is the standard fusion algorithm: score = 1/(keyword_rank + 60) + 1/(vector_rank + 60).
  • Index documents once with embeddings, then query them twice (keyword and vector), then fuse and rerank.
  • Monitor query success by type (factual vs. semantic) to identify which retrieval component to optimize.
  • Reranking (article 5) is often more cost-effective than tuning hybrid parameters.

Frequently Asked Questions

BM25 is superior to TF-IDF for most tasks. BM25 includes document length normalization and saturation effects (diminishing returns for term frequency), making it more robust. Use BM25 unless your tool (e.g., Elasticsearch) requires otherwise.

What fusion constant should I use in RRF?

The constant 60 is empirically standard across industry and academia. Some teams use 1 or 100; the exact value is less important than consistency. If your keyword and vector systems have very different scales, experiment with 10–100 and evaluate on your test set (article 8).

Does hybrid retrieval increase latency significantly?

Two searches instead of one doubles latency, but both are fast. A well-indexed BM25 system returns top-10 in 5–10ms; vector search on ANN indices returns in 10–20ms. Total latency for hybrid: 20–40ms, still acceptable for real-time systems. Reranking adds another 50ms if used.

Can I use pure vector search with a better embedding model instead of hybrid?

Possibly. Larger embedding models (OpenAI text-embedding-3-large, 1536 dims) sometimes approach hybrid quality on some benchmarks. Test on your evaluation set. Pure vector is simpler to operate, but hybrid's robustness usually justifies the complexity.

BM25 is language-agnostic (it counts term occurrences). For vector search, use a multilingual embedding model (Cohere embed-multilingual-v3.0). RRF fusion works unchanged across languages.

Further Reading