Skip to main content

Query Expansion: Generating Better Search Terms

Query expansion augments a user's original query with semantically related variants—paraphrases, synonyms, and implicit sub-queries—to improve recall in hybrid retrieval systems. A user asking "What is transformer attention?" may use different terms than documents in your corpus. Query expansion generates variants like "How do transformers use attention mechanisms?" and "Self-attention in neural networks", embedding and searching each variant, then merging results. This multi-query approach addresses the vocabulary gap between user intent and corpus terminology, particularly valuable for dense retrieval (which improves on synonyms) and sparse retrieval (which benefits from term diversity). Published benchmarks show query expansion improves retrieval NDCG@10 by 8–15%, with latency cost of <100 ms for most expansion methods. For RAG systems with diverse corpora (books, technical docs, web content), query expansion is a high-ROI component of hybrid retrieval pipelines.

Query Expansion Methods

Several methods exist for generating query variants, with different trade-offs:

LLM-Based Paraphrasing (Strongest Quality, Higher Cost): Use an LLM to generate 3–5 paraphrases of the original query:

import anthropic

client = anthropic.Anthropic()

def expand_query_with_llm(query: str, num_variants: int = 4) -> list[str]:
"""Generate query paraphrases using Claude"""

message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Generate {num_variants} alternative phrasings of this search query.
Each variant should be natural, between 5-15 words, and capture the same intent.

Original query: "{query}"

Return only the variants, one per line, without numbering or quotes."""
}]
)

variants = message.content[0].text.strip().split('\n')
return [v.strip() for v in variants if v.strip()]

# Example
original = "transformer attention mechanism"
variants = expand_query_with_llm(original)
print(f"Original: {original}")
print("Variants:")
for v in variants:
print(f" - {v}")

# Output:
# Original: transformer attention mechanism
# Variants:
# - How do attention mechanisms work in transformers?
# - Self-attention in transformer neural networks
# - Scaled dot-product attention in deep learning
# - What is the attention mechanism used by transformers?

Synonym Extraction (Fast, Precise): Use a thesaurus or NLP library to extract synonyms for key query terms:

import nltk
from nltk.corpus import wordnet

def expand_query_with_synonyms(query: str) -> list[str]:
"""Generate query variants using WordNet synonyms"""

tokens = query.lower().split()
expanded = [query] # Include original

for token in tokens:
synsets = wordnet.synsets(token)
if synsets:
synonyms = set()
for synset in synsets:
for lemma in synset.lemmas():
synonyms.add(lemma.name().replace('_', ' '))

if synonyms:
# Generate variants with each synonym
for synonym in list(synonyms)[:2]: # Use top 2
variant = query.replace(token, synonym)
if variant not in expanded:
expanded.append(variant)

return expanded

# Example
query = "transformer attention"
variants = expand_query_with_synonyms(query)
# variants = ['transformer attention', 'transformer tending', 'converter attention']

Pseudo-Relevance Feedback (PRF, Iterative): Retrieve initial results, analyze top-k documents for frequent terms, and re-query with enriched terms:

def expand_query_prf(original_query: str, bm25_index, k: int = 10) -> list[str]:
"""Expand query using pseudo-relevance feedback"""

# Retrieve initial results
initial_results = bm25_index.search(original_query, top_k=k)

# Extract frequent terms from top results
from collections import Counter
all_terms = []
for doc_id, doc_text, _ in initial_results:
terms = doc_text.lower().split()
all_terms.extend(terms)

# Get most common non-stopword terms
stopwords = {'the', 'a', 'an', 'and', 'or', 'is', 'in', 'of', 'to', 'for'}
term_freq = Counter([t for t in all_terms if t not in stopwords])
frequent_terms = [t for t, _ in term_freq.most_common(5)]

# Generate variants by adding frequent terms
variants = [original_query]
for term in frequent_terms:
variant = f"{original_query} {term}"
variants.append(variant)

return variants

# Example
original = "transformer attention"
variants = expand_query_prf(original, bm25_index)
# variants = ['transformer attention',
# 'transformer attention mechanism',
# 'transformer attention networks',
# 'transformer attention deep',
# 'transformer attention learning']

Embedding-Based Similarity (Semantic Variants): Embed the query and find semantically similar terms in a word embedding space:

from sentence_transformers import SentenceTransformer, util
import torch

def expand_query_semantic(query: str, word_embeddings: dict, top_n: int = 3) -> list[str]:
"""Expand query using semantic similarity in embedding space"""

model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed original query
query_emb = model.encode(query)

# Find top semantically similar phrases from corpus
corpus_phrases = list(word_embeddings.keys())
corpus_embs = torch.tensor([word_embeddings[p] for p in corpus_phrases])

# Compute cosine similarities
similarities = util.pytorch_cos_sim(query_emb, corpus_embs)[0]
top_similar = torch.topk(similarities, k=top_n)

variants = [query]
for idx in top_similar.indices:
phrase = corpus_phrases[idx]
variants.append(phrase)

return variants

Integration: Multi-Query Retrieval in RAG

Once you have query variants, embed and search each one, then fuse results:

async def retrieve_with_expansion(
original_query: str,
expansion_method: str = 'llm', # 'llm', 'synonym', 'prf', 'semantic'
bm25_index=None,
dense_retriever=None,
num_variants: int = 4
) -> list[dict]:
"""Retrieve using query expansion"""

# Generate variants
if expansion_method == 'llm':
variants = expand_query_with_llm(original_query, num_variants)
elif expansion_method == 'synonym':
variants = expand_query_with_synonyms(original_query)
elif expansion_method == 'prf':
variants = expand_query_prf(original_query, bm25_index)
elif expansion_method == 'semantic':
variants = expand_query_semantic(original_query, word_embeddings)
else:
variants = [original_query]

# Retrieve for each variant
all_results = []
for variant in variants:
# BM25 retrieval
bm25_results = bm25_retrieve(variant, top_k=50)

# Dense retrieval
dense_results = dense_retrieve(variant, top_k=50)

# Fuse
fused = rrf_fusion(bm25_results, dense_results)
all_results.extend(fused)

# Deduplicate and aggregate scores
aggregated = {}
for doc_id, text, score in all_results:
if doc_id not in aggregated:
aggregated[doc_id] = {'text': text, 'scores': []}
aggregated[doc_id]['scores'].append(score)

# Compute combined scores (mean, max, or weighted average)
final_results = [
(doc_id, info['text'], sum(info['scores']) / len(info['scores']))
for doc_id, info in aggregated.items()
]

# Sort and return top-k
final_results.sort(key=lambda x: x[2], reverse=True)
return final_results[:10]

# Example usage
results = await retrieve_with_expansion(
"What is transformer attention?",
expansion_method='llm',
num_variants=4
)
print(f"Retrieved {len(results)} documents across {4} query variants")

Comparison of Query Expansion Methods

MethodQualitySpeedCostUse Case
LLM Paraphrasing9/10Slow (1-2 sec)High ($)High-accuracy, low-volume systems
Synonym Extraction7/10Fast (<50 ms)FreeBalanced, production systems
Pseudo-Relevance Feedback8/10Medium (100 ms)FreeIterative, diverse queries
Embedding-Based7.5/10Medium (100 ms)FreeSemantic-aware, corpus-aligned

For most production RAG systems, synonym extraction offers best latency-quality balance. For high-accuracy applications where latency is secondary, LLM paraphrasing. For iterative systems (user refines query), pseudo-relevance feedback.

Benchmarks: Impact of Query Expansion

Published results on MS MARCO (Microsoft Machine Reading Comprehension):

Retrieval MethodNDCG@10 (No Expansion)NDCG@10 (With Expansion)Improvement
BM25 only0.2870.312+8.7%
Dense only0.3400.362+6.5%
Hybrid (BM25 + Dense + RRF)0.3650.400+9.6%
Hybrid + Cross-Encoder0.3740.408+9.1%

Query expansion provides consistent 6–10% accuracy improvement across all retrieval strategies. The gains are largest for hybrid + reranking (the most mature pipeline), suggesting expansion addresses remaining gaps.

Latency and Cost Considerations

LLM-based expansion:

  • Cost: $0.01–0.02 per query (Anthropic/OpenAI API)
  • Latency: 1–2 seconds (LLM generation + 4 variant searches)
  • Suitable for: Batch processing, user-initiated searches where speed is less critical

Synonym expansion:

  • Cost: Free (local computation)
  • Latency: 50–100 ms (local synonym lookup)
  • Suitable for: Real-time interactive systems, high-volume applications

Hybrid approach: Cache expensive LLM paraphrases. For frequently asked queries (top 20%), use pre-cached LLM variants. For one-off queries, fall back to fast synonym expansion. This balances accuracy and cost.

Key Takeaways

  • Query expansion augments user queries with semantic variants to improve recall and address vocabulary gaps between user intent and corpus terminology.
  • LLM-based paraphrasing (highest quality) costs time and money; synonym extraction (fast, free) is practical for production; pseudo-relevance feedback bridges the gap.
  • Query expansion improves hybrid retrieval accuracy by 6–10% (NDCG@10) across published benchmarks, making it a valuable component of mature RAG pipelines.
  • Expand to 3–5 variants, deduplicate results, and aggregate scores across variants to avoid amplifying noise.
  • For production systems, start with synonym expansion or caching LLM paraphrases for high-volume queries.

Frequently Asked Questions

How many query variants should I generate?

Generate 3–5 variants. Diminishing returns occur beyond 4: the 5th variant adds <2% accuracy while doubling latency. For LLM-based expansion, use 3 for speed (<1 sec); for synonym expansion, use 5 (negligible cost).

Should I search each variant independently or merge them first?

Search each variant independently, then fuse results. This captures variant-specific signals (e.g., one variant may rank a document highly that another misses). Merging queries before searching loses variant-specific evidence.

How do I aggregate scores from multiple variants?

Use mean (average score across variants) for simplicity. Use max (maximum score) to prioritize documents that rank highly in any variant. Use reciprocal rank fusion (RRF) to combine rank positions without normalizing scores.

Can I combine query expansion with reranking?

Yes. Expand query, search each variant, fuse results, then rerank top-50. The combination (expansion + reranking) provides the best accuracy (9–12% improvement over baseline), but increases latency to 500–800 ms.

How do I handle queries that are already expansive (e.g., multi-sentence)?

Treat multi-sentence queries as atomic. Expansion is most valuable for short, ambiguous queries (1–5 words). For verbose queries, try query shortening (extract key noun phrases) followed by expansion.

Further Reading