Skip to main content

HyDE: Hypothetical Document Expansion

HyDE (Hypothetical Document Expansion) improves semantic retrieval by generating plausible hypothetical documents that a user query might match, then using those documents to guide the search. Instead of embedding the query directly, you generate a hypothetical answer document with similar vocabulary and structure, embed that, and search for real documents near it. This bridges the "vocabulary gap" between natural language questions and document content, improving retrieval accuracy by 20–30% on question-answering tasks (Gao et al., 2022).

The Query-Document Vocabulary Gap

A user might ask "How do I cure a cold?", but medical documents say "The common cold is a viral infection treated with supportive care." Vector similarity between the question embedding and document embedding is low because they use different vocabularies. HyDE generates "A cold is a viral infection spread by respiratory droplets. Treatment includes rest, fluids, and symptom management." This synthetic document shares the document's vocabulary, improving retrieval before any actual question-to-document comparison.

Generating Hypothetical Documents

Use an LLM to generate plausible documents for a query:

from anthropic import Anthropic

client = Anthropic()

def generate_hypothetical_document(query: str, num_docs: int = 3) -> list[str]:
"""Generate hypothetical documents that would answer the query."""
hyde_prompt = """Generate {num} plausible documents that would answer this question.
Each document should be 50-150 words and use typical vocabulary/structure of real documents.
Write as if these are excerpts from a knowledge base, not answers to the question.

Question: {query}

Generate the documents with minimal preamble. Start each with "Document {i}:" """.format(
num=num_docs, query=query
)

response = client.messages.create(
model="claude-opus-4-1",
max_tokens=800,
messages=[{"role": "user", "content": hyde_prompt}]
)

text = response.content[0].text
# Parse documents from response
docs = []
for line in text.split('\n'):
if line.strip().startswith('Document'):
# Extract document content
doc_text = line.split(':', 1)[1].strip() if ':' in line else ''
if doc_text:
docs.append(doc_text)

return docs if docs else [text] # Fallback: return full response as one doc

# Example usage
query = "What are the benefits of remote work?"
hypothetical_docs = generate_hypothetical_document(query, num_docs=3)

for i, doc in enumerate(hypothetical_docs, 1):
print(f"Hypothetical Document {i}:\n{doc}\n")
# Output example:
# Hypothetical Document 1:
# Remote work enables employees to maintain focus by reducing office distractions
# and commute time. Studies show that workers in home environments report 36% higher
# satisfaction and 24% improved productivity. The flexibility also improves work-life
# balance, with 78% of remote workers citing better health outcomes.

The key is instructing the LLM to generate documents that sound like real content, not answers to the question. This ensures the hypothetical documents share vocabulary and structure with your actual knowledge base.

Embedding and Retrieval with Hypothetical Documents

Embed the hypothetical documents and use them to find real documents:

def hyde_retrieval(query: str, document_embeddings: dict, 
embedding_model_fn, retriever_fn) -> list[str]:
"""Retrieve documents using HyDE."""
# Step 1: Generate hypothetical documents
hypothetical_docs = generate_hypothetical_document(query, num_docs=3)

# Step 2: Embed the hypothetical documents
hypothetical_embeddings = [embedding_model_fn(doc) for doc in hypothetical_docs]

# Step 3: Find real documents similar to hypothetical documents
# (In production, use a vector database with similarity search)
retrieved_docs = []
for hyp_embedding in hypothetical_embeddings:
# Pseudo-code: find nearest neighbors to hyp_embedding in document_embeddings
similar = retriever_fn(hyp_embedding, top_k=5)
retrieved_docs.extend(similar)

# Step 4: Deduplicate and rank by relevance
unique_docs = {}
for doc in retrieved_docs:
if doc["id"] not in unique_docs:
unique_docs[doc["id"]] = doc

# Return top documents
return sorted(unique_docs.values(),
key=lambda x: x.get("score", 0),
reverse=True)[:10]

# Example with mock embeddings
def mock_embedding(text: str) -> list[float]:
"""Mock embedding; in production use a real embedding API."""
return [0.1] * 384 # Simulate 384-dim embedding

def mock_retriever(embedding: list[float], top_k: int = 5) -> list[dict]:
"""Mock retriever; in production use a vector database."""
return [
{"id": "doc1", "title": "Remote Work Benefits", "score": 0.92},
{"id": "doc2", "title": "Flexible Work Policies", "score": 0.87}
]

results = hyde_retrieval(
"What are the benefits of remote work?",
{},
mock_embedding,
mock_retriever
)
print(f"Retrieved {len(results)} documents")

Comparison: Direct Query Embedding vs. HyDE

ApproachLatencyAccuracySetup
Direct query embedding50–100 ms65–75% (baseline)Simple
HyDE (3 hypothetical docs)300–500 ms85–95%Requires LLM call
HyDE (1 hypothetical doc)150–250 ms78–88%Good balance
HyDE + re-ranking500–700 ms88–96%Best accuracy

For most applications, use 1–2 hypothetical documents to balance speed and accuracy.

Advanced: Multi-Query Expansion

Generate multiple interpretations of a single query:

def multi_query_expansion(query: str) -> list[str]:
"""Generate multiple interpretations of the same query."""
expansion_prompt = """Given this query, generate 3-5 alternative phrasings or
interpretations that might match different documents in a knowledge base.
These should be natural variations, not exact repeats.

Query: {query}

Return JSON: {{"queries": ["variant1", "variant2", ...]}}""".format(query=query)

response = client.messages.create(
model="claude-haiku", # Cheaper model for expansion
max_tokens=200,
messages=[{"role": "user", "content": expansion_prompt}]
)

import json
text = response.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
data = json.loads(text[start:end])
return data["queries"]

# Example
variants = multi_query_expansion("How do I implement async/await in Python?")
for v in variants:
print(f"- {v}")
# Output:
# - Implementing async functions with Python asyncio
# - Python coroutines and asynchronous programming
# - How to write non-blocking code in Python
# - Async/await syntax in Python 3.7+
# - Python concurrency with async/await

Then embed and retrieve for each variant, merging results. This increases coverage (90–95% vs 85–90% for single query) at the cost of 2–3x retrieval calls.

Optimization: Caching Hypothetical Documents

For repeated or similar queries, cache generated hypothetical documents:

def cached_hyde_retrieval(query: str, cache: dict, embedding_model_fn, 
retriever_fn) -> list[str]:
"""HyDE retrieval with caching of hypothetical documents."""
cache_key = query.lower()

if cache_key in cache:
hypothetical_docs = cache[cache_key]
else:
hypothetical_docs = generate_hypothetical_document(query, num_docs=2)
cache[cache_key] = hypothetical_docs

# Continue with retrieval (same as before)
hypothetical_embeddings = [embedding_model_fn(doc) for doc in hypothetical_docs]
retrieved_docs = []
for emb in hypothetical_embeddings:
retrieved_docs.extend(retriever_fn(emb, top_k=5))

return list({d["id"]: d for d in retrieved_docs}.values())[:10]

# Cache hits improve latency from 300–500 ms to <100 ms
hyde_cache = {}

Key Takeaways

  • HyDE generates hypothetical documents to bridge vocabulary gaps between queries and documents, improving retrieval accuracy by 20–30%.
  • Generate 1–3 hypothetical documents per query; use smaller models (Haiku) for speed.
  • Embed hypothetical documents and search for real documents near them—this vocabulary matching improves semantic alignment.
  • Combine HyDE with multi-query expansion for 90%+ coverage on diverse query intents.
  • Cache hypothetical documents for repeated queries to cut retrieval latency by 60–70%.

Frequently Asked Questions

Should I use HyDE for all queries or only some?

Use HyDE for open-domain questions ("What is X?", "How do I...?") where vocabulary gaps are common. Skip it for entity lookup ("Who founded Y?") or database queries where direct embeddings work well. Estimate: ~60–70% of natural language queries benefit from HyDE.

What's the difference between HyDE and query expansion?

Query expansion generates alternative wordings of the query; HyDE generates plausible documents. Expansion is cheaper (one LLM call, multiple short strings); HyDE is more effective (longer, vocabulary-rich documents). For production, combine both: expand the query AND generate hypothetical documents.

Can I use a smaller model to generate hypothetical documents?

Yes, Claude Haiku (50x cheaper) works well for HyDE. Quality drops 2–3% compared to Opus, but latency improves from 500 ms to 200 ms. For 99% of use cases, Haiku's quality is sufficient.

How do I measure HyDE effectiveness?

Create a test set of 50–100 queries with ground-truth relevant documents. Measure retrieval@10 accuracy: (# of top-10 results containing ground truth) / (# queries). Compare direct query embedding vs. HyDE. Aim for 5–15% improvement over baseline.

Does HyDE work for non-English languages?

Yes, with caveats. HyDE works best in languages with large document corpora (Spanish, Chinese, French). For low-resource languages, accuracy drops because the LLM generates less realistic hypothetical documents. Test before production deployment.

Further Reading