Combining RAG Patterns: Fusion Strategies for Robustness
Real-world RAG systems rarely use a single pattern in isolation. Instead, the most robust systems combine multiple patterns—query routing to choose the right approach, multi-hop for complex questions, GraphRAG for entity relationships, HyDE for semantic bridging, self-grading for quality control—into an ensemble that leverages each pattern's strengths. This fusion approach improves accuracy by 40–50% compared to single-pattern RAG while maintaining reasonable latency through intelligent pattern selection and parallel execution (Wang et al., 2024).
The Limits of Single Patterns
Each RAG pattern excels at specific query types but struggles with others:
| Pattern | Excels At | Struggles With |
|---|---|---|
| Vector search | Semantic similarity, keyword overlap | Entity relationships, temporal logic |
| Multi-hop retrieval | Complex reasoning, comparisons | Simple factual lookup (overhead) |
| GraphRAG | Entity relationships, structured facts | Open-domain exploration, nuance |
| HyDE | Vocabulary mismatch, semantic gaps | Entity-specific queries |
| Self-RAG | Quality assurance, hallucination prevention | Speed (requires extra LLM calls) |
A hybrid system routes each query to the appropriate pattern(s), combining results for robustness.
Intelligent Pattern Selection
Route queries to the most suitable pattern(s):
from anthropic import Anthropic
import json
client = Anthropic()
def select_rag_patterns(query: str) -> list[dict]:
"""Select appropriate RAG patterns for a query."""
selection_prompt = """Analyze this query and recommend RAG patterns to use.
Query: {query}
Patterns:
- vector_search: General semantic retrieval (fast, ~150ms)
- multi_hop: Complex reasoning over multiple docs (slow, ~800ms, best for 'compare/analyze/trace')
- graphrag: Entity relationships and structured facts (fast, ~200ms, for 'who/what/relationship')
- hyde: Semantic bridging for vocabulary gaps (medium, ~300ms, for open-domain Q&A)
- self_rag: Quality grading and self-correction (overhead +200ms, for high-stakes queries)
Recommend 1-3 patterns to use in parallel. Prioritize by speed, then accuracy.
Return JSON: {{
"patterns": [
{{"name": "vector_search", "confidence": 0.95, "rationale": "..."}},
{{"name": "graphrag", "confidence": 0.8, "rationale": "..."}}
],
"parallel": true,
"estimated_latency_ms": 250
}}""".format(query=query)
response = client.messages.create(
model="claude-haiku",
max_tokens=300,
messages=[{"role": "user", "content": selection_prompt}]
)
text = response.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
return json.loads(text[start:end])
# Example
queries = [
"What is machine learning?", # Simple factual
"Compare AWS and Azure cloud services", # Comparison
"Who founded OpenAI and what is their vision?", # Entity + reasoning
]
for q in queries:
patterns = select_rag_patterns(q)
print(f"Query: {q}")
for p in patterns["patterns"]:
print(f" → {p['name']} ({p['confidence']:.0%}): {p['rationale']}")
print(f" Latency: {patterns['estimated_latency_ms']}ms\n")
Intelligent selection ensures queries route to the most cost-effective and accurate patterns.
Parallel Execution and Result Fusion
Execute multiple patterns in parallel, then fuse results:
import asyncio
from typing import Coroutine
async def execute_pattern(pattern_name: str, query: str,
retriever_fn) -> dict:
"""Execute a single RAG pattern asynchronously."""
if pattern_name == "vector_search":
# Semantic retrieval
docs = await asyncio.to_thread(retriever_fn, query)
return {
"pattern": "vector_search",
"documents": docs,
"score": 0.85,
"latency_ms": 150
}
elif pattern_name == "graphrag":
# Entity retrieval from knowledge graph
# (Simplified; in production, query a graph DB)
docs = await asyncio.to_thread(
lambda q: [f"Entity result for {q}"],
query
)
return {
"pattern": "graphrag",
"documents": docs,
"score": 0.78,
"latency_ms": 200
}
elif pattern_name == "hyde":
# Hypothetical document expansion
# (Simplified)
docs = await asyncio.to_thread(
lambda q: [f"HyDE result for {q}"],
query
)
return {
"pattern": "hyde",
"documents": docs,
"score": 0.72,
"latency_ms": 300
}
return {
"pattern": pattern_name,
"documents": [],
"score": 0.0,
"latency_ms": 0
}
async def hybrid_rag_execute(query: str, patterns: list[str],
retriever_fn) -> list[dict]:
"""Execute multiple patterns in parallel."""
# Execute all patterns concurrently
tasks = [
execute_pattern(p, query, retriever_fn) for p in patterns
]
results = await asyncio.gather(*tasks)
return results
def fuse_results(pattern_results: list[dict], fusion_strategy: str = "weighted") -> list[str]:
"""Fuse results from multiple patterns."""
if fusion_strategy == "weighted":
# Weight results by pattern confidence and relevance
all_docs = []
scores = {}
for result in pattern_results:
for doc in result["documents"][:3]: # Top 3 from each pattern
doc_id = id(doc) # Simplified; use real doc ID
if doc_id not in scores:
scores[doc_id] = 0
all_docs.append(doc)
# Weight by pattern strength (assume vector_search is strongest)
if result["pattern"] == "vector_search":
scores[doc_id] += 0.5 * result["score"]
elif result["pattern"] == "graphrag":
scores[doc_id] += 0.3 * result["score"]
else:
scores[doc_id] += 0.2 * result["score"]
# Sort by accumulated score
sorted_docs = sorted(
[(doc, scores[id(doc)]) for doc in all_docs],
key=lambda x: x[1],
reverse=True
)
return [doc for doc, _ in sorted_docs[:10]]
elif fusion_strategy == "dedup":
# Remove duplicates, keep highest-scoring instance
seen = set()
results = []
for result in pattern_results:
for doc in result["documents"]:
doc_hash = hash(doc) # Simplified
if doc_hash not in seen:
seen.add(doc_hash)
results.append(doc)
return results[:10]
return []
# Example (with mock async)
def mock_retriever(q: str) -> list[str]:
return ["Mock document"]
async def test_hybrid():
patterns = ["vector_search", "graphrag", "hyde"]
results = await hybrid_rag_execute(
"Who founded OpenAI?",
patterns,
mock_retriever
)
print(f"Pattern results: {[r['pattern'] for r in results]}")
print(f"Avg latency: {sum(r['latency_ms'] for r in results) / len(results):.0f}ms")
fused = fuse_results(results)
print(f"Fused documents: {len(fused)}")
# Uncomment to run: asyncio.run(test_hybrid())
Parallel execution reduces total latency: running three patterns sequentially takes 650 ms; in parallel, ~300 ms (the slowest pattern).
Consensus-Based Response Generation
Generate multiple responses (one per pattern) and synthesize a consensus:
def generate_consensus_response(query: str, pattern_results: list[dict]) -> dict:
"""Generate responses from each pattern and synthesize consensus."""
# Step 1: Generate response for each pattern's results
individual_responses = []
for result in pattern_results:
docs_text = "\n---\n".join(result["documents"][:3])
generation_prompt = f"""Based on these documents from {result['pattern']} retrieval:
{docs_text}
Answer this query: {query}
Keep the answer to 2-3 sentences."""
response = client.messages.create(
model="claude-haiku", # Use smaller model for speed
max_tokens=100,
messages=[{"role": "user", "content": generation_prompt}]
)
individual_responses.append({
"pattern": result["pattern"],
"response": response.content[0].text
})
# Step 2: Synthesize consensus
consensus_prompt = f"""Analyze these responses to the same query from different retrieval patterns:
Query: {query}
Responses:
"""
for resp in individual_responses:
consensus_prompt += f"\n{resp['pattern']}:\n{resp['response']}\n"
consensus_prompt += """
Synthesize a single, high-quality response that:
1. Incorporates accurate information from all patterns
2. Resolves any conflicts (if they exist)
3. Prioritizes information from more authoritative patterns
4. Is concise and directly answers the query"""
consensus = client.messages.create(
model="claude-opus-4-1", # Use larger model for synthesis
max_tokens=200,
messages=[{"role": "user", "content": consensus_prompt}]
)
return {
"query": query,
"individual_responses": individual_responses,
"consensus_response": consensus.content[0].text,
"num_patterns": len(individual_responses)
}
# Example
results = [
{"pattern": "vector_search", "documents": ["Doc A"]},
{"pattern": "graphrag", "documents": ["Doc B"]},
]
consensus = generate_consensus_response("Who is Dario Amodei?", results)
print(f"Consensus: {consensus['consensus_response']}")
Consensus generation adds 200–400 ms but significantly improves answer quality (15–25% improvement in user satisfaction).
Fallback Chain for Resilience
Chain patterns with fallback logic:
def hybrid_rag_with_fallback(query: str, retriever_fn) -> dict:
"""Execute RAG with pattern fallback for robustness."""
# Select patterns
patterns = select_rag_patterns(query)
pattern_names = [p["name"] for p in patterns["patterns"]]
# Try primary patterns
executed_results = []
for pattern in pattern_names:
try:
result = execute_pattern(pattern, query, retriever_fn)
if result["documents"]: # Success
executed_results.append(result)
break # Use first successful pattern
except Exception as e:
print(f"Pattern {pattern} failed: {e}")
continue
# Fallback: if all patterns fail or return low confidence, use broad retrieval
if not executed_results or max(r["score"] for r in executed_results) < 0.5:
print("Falling back to broad vector search")
result = {
"pattern": "fallback_vector_search",
"documents": retriever_fn(query), # Broad search
"score": 0.6,
"latency_ms": 250
}
executed_results.append(result)
# Generate response from best result
best_result = max(executed_results, key=lambda r: r["score"])
docs_text = "\n---\n".join(best_result["documents"][:3])
generation_prompt = f"""Based on these documents:
{docs_text}
Answer: {query}"""
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": generation_prompt}]
)
return {
"query": query,
"pattern_used": best_result["pattern"],
"response": response.content[0].text,
"confidence": best_result["score"]
}
# Example (would need async support in production)
result = hybrid_rag_with_fallback("Explain quantum entanglement", mock_retriever)
print(f"Pattern: {result['pattern_used']}")
print(f"Confidence: {result['confidence']:.0%}")
Fallback chains ensure availability: if GraphRAG finds no entities, fall back to vector search; if vector search is slow, use keyword search. Always have a working fallback.
Comparison: Single vs. Hybrid RAG
| Metric | Single Pattern | Hybrid (3 patterns) |
|---|---|---|
| Accuracy (simple Q) | 82% | 84% (+2%) |
| Accuracy (complex Q) | 68% | 88% (+20%) |
| Accuracy (entity Q) | 75% | 90% (+15%) |
| Latency (sequential) | 250 ms | 1500 ms |
| Latency (parallel) | 250 ms | 350 ms |
| Hallucination rate | 8% | 2% |
| Cost (3 patterns) | $0.002 | $0.008 |
Hybrid RAG with parallel execution is nearly as fast as single-pattern but dramatically more accurate, especially on complex and entity-based questions.
Key Takeaways
- Hybrid RAG combines multiple patterns (vector search, multi-hop, GraphRAG, HyDE, self-RAG) to handle diverse query types.
- Use intelligent routing to select 1–3 patterns per query; execute in parallel to minimize latency (300–400 ms vs 1500 ms sequential).
- Fuse results using weighted scoring (higher weights for proven-good patterns like vector search) or deduplication.
- Generate consensus responses from multiple patterns; synthesis adds 200–400 ms but improves quality 15–25%.
- Implement fallback chains for resilience: if primary patterns fail, degrade gracefully to broad retrieval.
Frequently Asked Questions
How many patterns should I combine?
Start with 2–3: vector search (base) + one specialized (GraphRAG for entities, HyDE for open-domain). More than 3 adds cost without proportional accuracy gain. Monitor: if two patterns consistently agree, drop the third.
Should I always run patterns in parallel?
Yes, if latency budget allows (300–400 ms total). Parallel execution is faster and provides redundancy. Sequential execution is only justified if compute is severely constrained; then, use intelligent routing to pick a single best pattern.
How do I weight patterns in fusion?
Calibrate on held-out test data. Typically: vector_search = 0.5, GraphRAG = 0.25, HyDE = 0.15, other = 0.1. Adjust weights monthly based on query distribution. If queries are mostly entity lookups, increase GraphRAG; if mostly open-domain, increase HyDE.
What if patterns give conflicting answers?
This is rare but important to handle. Flag conflicts in the consensus prompt: "Pattern A says X, Pattern B says Y. Reconcile." Let the LLM reason about sources and credibility. If unresolvable, present both answers and cite sources.
How much does hybrid RAG cost vs. single-pattern?
3–4x more (three patterns × 1.3x LLM calls for fusion/consensus). For high-stakes queries (financial, medical), this ROI is worth it. For exploratory queries (brainstorm, creative), stick to single-pattern. Cost: use Haiku for pattern selection and fusion; Opus only for final generation.
Further Reading
- Ensemble Retrieval Methods for Dense Passage Retrieval — techniques for combining retrieval methods.
- Multi-Perspective Machine Reading Comprehension — fusion strategies for reading comprehension.
- Retrieval-Augmented Multimodal Language Models — extending RAG patterns across modalities.
- Adaptive Information Retrieval: Graph-Based Approaches — intelligent routing and adaptation in RAG systems.