Skip to main content

Multi-Hop Retrieval: Iterative Query Chaining

Multi-hop retrieval is the process of decomposing a complex question into multiple sequential retrievals, where each hop refines the context or answers an intermediate question. Instead of retrieving all documents in a single query, multi-hop systems ask follow-up questions, retrieve documents based on those intermediate answers, and repeat until sufficient context is gathered. Research shows this approach improves accuracy by 30–40% on multi-document reasoning tasks compared to single-query retrieval (Yang et al., 2023).

How Multi-Hop Retrieval Works

Multi-hop retrieval mimics human research: when you ask a complex question like "How did the recession of 2008 impact bank lending practices?", you don't search once. Instead, you search for "2008 financial crisis causes", read that, then search for "banking regulations post-2008", and finally integrate the findings. Each hop is a new retrieval guided by intermediate results.

A multi-hop system has these components: a decomposition module that breaks questions into steps, a retrieval module that fetches relevant documents per step, and a fusion module that combines findings. The key insight is that intermediate retrieval results guide the next query, creating a chain of reasoning.

Decomposing Complex Queries

The first step is identifying which questions require multiple hops. Simple factual queries (e.g., "Who founded OpenAI?") need one hop; complex comparison or causal queries need multiple hops.

from anthropic import Anthropic

client = Anthropic()

def decompose_query(user_query: str) -> list[str]:
"""Decompose a user query into multi-hop sub-questions."""
decomposition_prompt = """Break this query into 2-4 independent sub-questions that,
when answered in sequence, will fully answer the original question.
Return a JSON list of sub-questions in logical order.
Example: For "How did AI regulations change in Europe?", return:
["What were the original EU AI regulations?",
"What new regulations were introduced in 2024?",
"How do they differ from previous rules?"]"""

response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
system=decomposition_prompt,
messages=[{"role": "user", "content": user_query}]
)

import json
text = response.content[0].text
# Extract JSON from response
start = text.find('[')
end = text.rfind(']') + 1
return json.loads(text[start:end])

# Example usage
query = "Compare cloud migration costs between AWS and Azure in 2025"
sub_questions = decompose_query(query)
for i, q in enumerate(sub_questions, 1):
print(f"{i}. {q}")
# Output:
# 1. What are AWS cloud migration costs in 2025?
# 2. What are Azure cloud migration costs in 2025?
# 3. What are the key cost differences between AWS and Azure?

A good decomposition creates 2–4 independent questions that, when answered sequentially, provide enough context for the original question. Too many hops (>5) increase latency and can compound errors; too few miss nuance.

Iterative Retrieval with Refinement

After decomposing, you retrieve documents for each sub-question, using the results to refine the next retrieval.

def multi_hop_retrieval(user_query: str, retriever_fn) -> dict:
"""Execute multi-hop retrieval with iterative refinement."""
# Step 1: Decompose the query
sub_questions = decompose_query(user_query)

# Step 2: Iteratively retrieve and refine
context = ""
retrieval_results = []

for i, sub_q in enumerate(sub_questions):
# Augment the sub-question with accumulated context
augmented_query = sub_q
if context:
augmented_query = f"{sub_q}\nContext from previous hops: {context[:300]}"

# Retrieve documents for this sub-question
docs = retriever_fn(augmented_query)
retrieval_results.append({
"hop": i + 1,
"question": sub_q,
"documents": docs
})

# Extract key facts from retrieved documents for the next hop
if docs:
summary_prompt = f"""Extract 1-2 key facts from these documents that answer: {sub_q}
Documents: {docs[0][:500]}"""

summary = client.messages.create(
model="claude-haiku", # Use smaller model for speed
max_tokens=100,
messages=[{"role": "user", "content": summary_prompt}]
)
context += summary.content[0].text + "\n"

return {
"original_query": user_query,
"sub_questions": sub_questions,
"retrieval_hops": retrieval_results,
"accumulated_context": context
}

def mock_retriever(query: str) -> list[str]:
"""Mock retriever for demonstration."""
return [f"Document about: {query}"]

# Example
result = multi_hop_retrieval(
"How do cloud migration costs compare between AWS and Azure?",
mock_retriever
)
print(f"Executed {len(result['retrieval_hops'])} retrieval hops")

Key design decisions:

  • Augment queries with context: Include summaries from previous hops so later retrievals are more precise.
  • Use smaller models for intermediate steps: Claude Haiku costs 10x less than Opus for summarization.
  • Limit hops to 3–4: Each hop adds 200–500 ms latency; beyond 4 hops, diminishing returns dominate.

Comparison: Single vs. Multi-Hop Retrieval

AspectSingle-HopMulti-Hop
Latency200–400 ms800–1500 ms (3–4 hops)
Accuracy (simple Q)92%91% (slight overhead)
Accuracy (complex Q)65–70%95–98%
Cost per query$0.001–0.003$0.005–0.015
Best forFactual lookupComparative analysis, causal reasoning

Use multi-hop for questions with keywords like "compare", "explain why", "analyze", "trace the evolution of"; use single-hop for "What is X?" or "Who founded Y?".

Handling Contradictions Across Hops

When multi-hop retrieval returns conflicting information, use a reconciliation step:

def reconcile_contradictions(hop_results: list[dict]) -> str:
"""Identify and reconcile conflicting information across hops."""
facts = "\n".join([
f"Hop {h['hop']}: {h['documents'][0][:200]}"
for h in hop_results if h['documents']
])

reconciliation_prompt = f"""These statements come from different sources:
{facts}

Are there contradictions? If yes, rank by source authority and suggest resolution."""

response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": reconciliation_prompt}]
)
return response.content[0].text

This step catches cases where early-hop results mislead later retrievals. In production, log contradictions—they often indicate gaps in your knowledge base or outdated information.

Key Takeaways

  • Multi-hop retrieval breaks complex questions into 2–4 sequential sub-questions, improving accuracy by 30–40% on reasoning tasks.
  • Decompose queries using an LLM; refine each sub-question with context from previous hops.
  • Use cheaper models (Haiku) for intermediate steps; reserve larger models for final reasoning.
  • Limit hops to 3–4 to control latency; monitor cost (5–15x single-hop) for budget-conscious applications.
  • Reconcile contradictions across hops using authority-based ranking.

Frequently Asked Questions

How do I know if a query needs multi-hop retrieval?

Look for linguistic markers: "compare", "analyze", "explain why", "trace", "how did X lead to Y?" and multi-entity questions. In production, log single-hop failures and rerun those with multi-hop to measure improvement. Typically 20–30% of queries benefit from multi-hop.

What if a sub-question retrieves no documents?

Skip that hop and continue. If a later hop depends on a skipped hop, mark it and fall back to the original query with a broad retrieval. In code, use if docs: guards and graceful degradation.

Should I show the decomposition to the user?

Only in transparency-focused applications. For general RAG, hide decomposition details in system output. Log it for debugging. Users care about the final answer, not intermediate steps.

Can I run hops in parallel instead of sequentially?

Yes, if your sub-questions are independent (e.g., "AWS costs" and "Azure costs"). Use asyncio or concurrent.futures for parallel retrieval. This cuts latency from 1500 ms to 600 ms but requires independent sub-questions—most 3-4 hop chains have dependencies.

How do I evaluate multi-hop retrieval quality?

Use metrics: (1) hop decomposition correctness (does each sub-question address the original?), (2) retrieval recall (did each hop fetch relevant docs?), (3) answer accuracy (does the final response correctly synthesize hops?). Manual evaluation on 50–100 test queries is essential before deployment.

Further Reading