Query Expansion: Generating Better Search Terms

Query expansion augments a user's original query with semantically related variants—paraphrases, synonyms, and implicit sub-queries—to improve recall in hybrid retrieval systems. A user asking "What is transformer attention?" may use different terms than documents in your corpus. Query expansion generates variants like "How do transformers use attention mechanisms?" and "Self-attention in neural networks", embedding and searching each variant, then merging results. This multi-query approach addresses the vocabulary gap between user intent and corpus terminology, particularly valuable for dense retrieval (which improves on synonyms) and sparse retrieval (which benefits from term diversity). Published benchmarks show query expansion improves retrieval NDCG@10 by 8–15%, with latency cost of <100 ms for most expansion methods. For RAG systems with diverse corpora (books, technical docs, web content), query expansion is a high-ROI component of hybrid retrieval pipelines.

Query Expansion Methods

Several methods exist for generating query variants, with different trade-offs:

LLM-Based Paraphrasing (Strongest Quality, Higher Cost): Use an LLM to generate 3–5 paraphrases of the original query:

import anthropic

client = anthropic.Anthropic()

def expand_query_with_llm(query: str, num_variants: int = 4) -> list[str]:
    """Generate query paraphrases using Claude"""
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Generate {num_variants} alternative phrasings of this search query. 
            Each variant should be natural, between 5-15 words, and capture the same intent.
            
            Original query: "{query}"
            
            Return only the variants, one per line, without numbering or quotes."""
        }]
    )
    
    variants = message.content[0].text.strip().split('\n')
    return [v.strip() for v in variants if v.strip()]

# Example
original = "transformer attention mechanism"
variants = expand_query_with_llm(original)
print(f"Original: {original}")
print("Variants:")
for v in variants:
    print(f"  - {v}")

# Output:
# Original: transformer attention mechanism
# Variants:
#   - How do attention mechanisms work in transformers?
#   - Self-attention in transformer neural networks
#   - Scaled dot-product attention in deep learning
#   - What is the attention mechanism used by transformers?

Synonym Extraction (Fast, Precise): Use a thesaurus or NLP library to extract synonyms for key query terms:

import nltk
from nltk.corpus import wordnet

def expand_query_with_synonyms(query: str) -> list[str]:
    """Generate query variants using WordNet synonyms"""
    
    tokens = query.lower().split()
    expanded = [query]  # Include original
    
    for token in tokens:
        synsets = wordnet.synsets(token)
        if synsets:
            synonyms = set()
            for synset in synsets:
                for lemma in synset.lemmas():
                    synonyms.add(lemma.name().replace('_', ' '))
            
            if synonyms:
                # Generate variants with each synonym
                for synonym in list(synonyms)[:2]:  # Use top 2
                    variant = query.replace(token, synonym)
                    if variant not in expanded:
                        expanded.append(variant)
    
    return expanded

# Example
query = "transformer attention"
variants = expand_query_with_synonyms(query)
# variants = ['transformer attention', 'transformer tending', 'converter attention']

Pseudo-Relevance Feedback (PRF, Iterative): Retrieve initial results, analyze top-k documents for frequent terms, and re-query with enriched terms:

def expand_query_prf(original_query: str, bm25_index, k: int = 10) -> list[str]:
    """Expand query using pseudo-relevance feedback"""
    
    # Retrieve initial results
    initial_results = bm25_index.search(original_query, top_k=k)
    
    # Extract frequent terms from top results
    from collections import Counter
    all_terms = []
    for doc_id, doc_text, _ in initial_results:
        terms = doc_text.lower().split()
        all_terms.extend(terms)
    
    # Get most common non-stopword terms
    stopwords = {'the', 'a', 'an', 'and', 'or', 'is', 'in', 'of', 'to', 'for'}
    term_freq = Counter([t for t in all_terms if t not in stopwords])
    frequent_terms = [t for t, _ in term_freq.most_common(5)]
    
    # Generate variants by adding frequent terms
    variants = [original_query]
    for term in frequent_terms:
        variant = f"{original_query} {term}"
        variants.append(variant)
    
    return variants

# Example
original = "transformer attention"
variants = expand_query_prf(original, bm25_index)
# variants = ['transformer attention', 
#             'transformer attention mechanism',
#             'transformer attention networks',
#             'transformer attention deep',
#             'transformer attention learning']

Embedding-Based Similarity (Semantic Variants): Embed the query and find semantically similar terms in a word embedding space:

from sentence_transformers import SentenceTransformer, util
import torch

def expand_query_semantic(query: str, word_embeddings: dict, top_n: int = 3) -> list[str]:
    """Expand query using semantic similarity in embedding space"""
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Embed original query
    query_emb = model.encode(query)
    
    # Find top semantically similar phrases from corpus
    corpus_phrases = list(word_embeddings.keys())
    corpus_embs = torch.tensor([word_embeddings[p] for p in corpus_phrases])
    
    # Compute cosine similarities
    similarities = util.pytorch_cos_sim(query_emb, corpus_embs)[0]
    top_similar = torch.topk(similarities, k=top_n)
    
    variants = [query]
    for idx in top_similar.indices:
        phrase = corpus_phrases[idx]
        variants.append(phrase)
    
    return variants

Integration: Multi-Query Retrieval in RAG

Once you have query variants, embed and search each one, then fuse results:

async def retrieve_with_expansion(
    original_query: str,
    expansion_method: str = 'llm',  # 'llm', 'synonym', 'prf', 'semantic'
    bm25_index=None,
    dense_retriever=None,
    num_variants: int = 4
) -> list[dict]:
    """Retrieve using query expansion"""
    
    # Generate variants
    if expansion_method == 'llm':
        variants = expand_query_with_llm(original_query, num_variants)
    elif expansion_method == 'synonym':
        variants = expand_query_with_synonyms(original_query)
    elif expansion_method == 'prf':
        variants = expand_query_prf(original_query, bm25_index)
    elif expansion_method == 'semantic':
        variants = expand_query_semantic(original_query, word_embeddings)
    else:
        variants = [original_query]
    
    # Retrieve for each variant
    all_results = []
    for variant in variants:
        # BM25 retrieval
        bm25_results = bm25_retrieve(variant, top_k=50)
        
        # Dense retrieval
        dense_results = dense_retrieve(variant, top_k=50)
        
        # Fuse
        fused = rrf_fusion(bm25_results, dense_results)
        all_results.extend(fused)
    
    # Deduplicate and aggregate scores
    aggregated = {}
    for doc_id, text, score in all_results:
        if doc_id not in aggregated:
            aggregated[doc_id] = {'text': text, 'scores': []}
        aggregated[doc_id]['scores'].append(score)
    
    # Compute combined scores (mean, max, or weighted average)
    final_results = [
        (doc_id, info['text'], sum(info['scores']) / len(info['scores']))
        for doc_id, info in aggregated.items()
    ]
    
    # Sort and return top-k
    final_results.sort(key=lambda x: x[2], reverse=True)
    return final_results[:10]

# Example usage
results = await retrieve_with_expansion(
    "What is transformer attention?",
    expansion_method='llm',
    num_variants=4
)
print(f"Retrieved {len(results)} documents across {4} query variants")

Comparison of Query Expansion Methods

Method	Quality	Speed	Cost	Use Case
LLM Paraphrasing	9/10	Slow (1-2 sec)	High ($)	High-accuracy, low-volume systems
Synonym Extraction	7/10	Fast (<50 ms)	Free	Balanced, production systems
Pseudo-Relevance Feedback	8/10	Medium (100 ms)	Free	Iterative, diverse queries
Embedding-Based	7.5/10	Medium (100 ms)	Free	Semantic-aware, corpus-aligned

For most production RAG systems, synonym extraction offers best latency-quality balance. For high-accuracy applications where latency is secondary, LLM paraphrasing. For iterative systems (user refines query), pseudo-relevance feedback.

Benchmarks: Impact of Query Expansion

Published results on MS MARCO (Microsoft Machine Reading Comprehension):

Retrieval Method	NDCG@10 (No Expansion)	NDCG@10 (With Expansion)	Improvement
BM25 only	0.287	0.312	+8.7%
Dense only	0.340	0.362	+6.5%
Hybrid (BM25 + Dense + RRF)	0.365	0.400	+9.6%
Hybrid + Cross-Encoder	0.374	0.408	+9.1%

Query expansion provides consistent 6–10% accuracy improvement across all retrieval strategies. The gains are largest for hybrid + reranking (the most mature pipeline), suggesting expansion addresses remaining gaps.

Latency and Cost Considerations

LLM-based expansion:

Cost: $0.01–0.02 per query (Anthropic/OpenAI API)
Latency: 1–2 seconds (LLM generation + 4 variant searches)
Suitable for: Batch processing, user-initiated searches where speed is less critical

Synonym expansion:

Cost: Free (local computation)
Latency: 50–100 ms (local synonym lookup)
Suitable for: Real-time interactive systems, high-volume applications

Hybrid approach: Cache expensive LLM paraphrases. For frequently asked queries (top 20%), use pre-cached LLM variants. For one-off queries, fall back to fast synonym expansion. This balances accuracy and cost.

Key Takeaways

Query expansion augments user queries with semantic variants to improve recall and address vocabulary gaps between user intent and corpus terminology.
LLM-based paraphrasing (highest quality) costs time and money; synonym extraction (fast, free) is practical for production; pseudo-relevance feedback bridges the gap.
Query expansion improves hybrid retrieval accuracy by 6–10% (NDCG@10) across published benchmarks, making it a valuable component of mature RAG pipelines.
Expand to 3–5 variants, deduplicate results, and aggregate scores across variants to avoid amplifying noise.
For production systems, start with synonym expansion or caching LLM paraphrases for high-volume queries.

Frequently Asked Questions

How many query variants should I generate?

Generate 3–5 variants. Diminishing returns occur beyond 4: the 5th variant adds <2% accuracy while doubling latency. For LLM-based expansion, use 3 for speed (<1 sec); for synonym expansion, use 5 (negligible cost).

Should I search each variant independently or merge them first?

Search each variant independently, then fuse results. This captures variant-specific signals (e.g., one variant may rank a document highly that another misses). Merging queries before searching loses variant-specific evidence.

How do I aggregate scores from multiple variants?

Use mean (average score across variants) for simplicity. Use max (maximum score) to prioritize documents that rank highly in any variant. Use reciprocal rank fusion (RRF) to combine rank positions without normalizing scores.

Can I combine query expansion with reranking?

Yes. Expand query, search each variant, fuse results, then rerank top-50. The combination (expansion + reranking) provides the best accuracy (9–12% improvement over baseline), but increases latency to 500–800 ms.

How do I handle queries that are already expansive (e.g., multi-sentence)?

Treat multi-sentence queries as atomic. Expansion is most valuable for short, ambiguous queries (1–5 words). For verbose queries, try query shortening (extract key noun phrases) followed by expansion.

Query Expansion Methods​

Integration: Multi-Query Retrieval in RAG​

Comparison of Query Expansion Methods​

Benchmarks: Impact of Query Expansion​

Latency and Cost Considerations​

Key Takeaways​

Frequently Asked Questions​

How many query variants should I generate?​

Should I search each variant independently or merge them first?​

How do I aggregate scores from multiple variants?​

Can I combine query expansion with reranking?​

How do I handle queries that are already expansive (e.g., multi-sentence)?​

Further Reading​