Corrective RAG (CRAG): Adaptive Fallback Retrieval
Corrective RAG (CRAG) is a framework that grades the quality of retrieved documents and escalates to alternative retrieval strategies when quality is low. If an initial retrieval returns low-relevance documents, CRAG triggers fallback mechanisms like broader search, web search, or knowledge distillation before generating a response. This approach improves accuracy by 25–35% on questions where traditional retrieval fails, while maintaining speed by avoiding unnecessary escalation when initial retrieval succeeds (Yan et al., 2023).
The CRAG Pipeline
CRAG operates in three stages:
- Retrieval quality assessment: Score retrieved documents for relevance and coverage.
- Fallback decision: If quality is low, escalate to alternative retrieval methods.
- Knowledge distillation: Generate synthetic documents from lower-quality sources if alternatives also fail.
This tiered approach ensures answers are generated from high-quality sources when available, but gracefully degrades to lower-quality sources or synthesis rather than hallucinating.
Document Quality Assessment
Assess retrieved documents before generation:
from anthropic import Anthropic
import json
client = Anthropic()
def assess_retrieval_quality(query: str, documents: list[str]) -> dict:
"""Assess the quality of retrieved documents."""
assessment_prompt = """Evaluate the quality of these retrieved documents for answering the query.
Query: {query}
Documents:
{docs_text}
For each document, score on:
- Relevance (0-100): Does it directly address the query?
- Comprehensiveness (0-100): Does it cover the main aspects?
- Authority (0-100): Does it come from a credible source?
- Recency (0-100): Is it current? (100 if N/A)
Return JSON: {{
"overall_quality": 0-100,
"quality_level": "HIGH|MEDIUM|LOW",
"per_document": [
{{"relevance": X, "comprehensiveness": Y, "authority": Z, "recency": W}}
],
"missing_aspects": ["aspect1", "aspect2"],
"recommendation": "USE|REFINE|ESCALATE"
}}""".format(
query=query,
docs_text="\n---\n".join([f"Doc {i}: {d[:300]}" for i, d in enumerate(documents)])
)
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=400,
messages=[{"role": "user", "content": assessment_prompt}]
)
text = response.content[0].text
start = text.find('{')
end = text.rfind('}') + 1
return json.loads(text[start:end])
# Example
docs = [
"Q3 2025 Microsoft revenue was $67.2B (source: investor.microsoft.com, June 2025)",
"Cloud computing growth accelerated in 2024..."
]
quality = assess_retrieval_quality("Microsoft Q3 2025 earnings", docs)
print(f"Quality Level: {quality['quality_level']}")
print(f"Recommendation: {quality['recommendation']}")
if quality['recommendation'] == "ESCALATE":
print(f"Missing aspects: {quality['missing_aspects']}")
Define quality thresholds: HIGH (score >75), MEDIUM (50–75), LOW (<50). CRAG escalates only if quality is LOW or MEDIUM with missing critical aspects.
Fallback Retrieval Strategies
If initial retrieval quality is low, escalate to alternatives:
def fallback_retrieval(query: str, initial_docs: list[str],
retriever_fn, fallback_level: int = 1) -> tuple[list[str], str]:
"""Escalate retrieval through fallback strategies."""
fallback_strategies = [
{
"name": "Broader search",
"description": "Remove query filters; search broader knowledge base",
"params": {"filter_strict": False, "top_k": 20}
},
{
"name": "Web search",
"description": "Search the web for current information",
"params": {"search_type": "web", "top_k": 10}
},
{
"name": "Knowledge distillation",
"description": "Synthesize answer from lower-quality sources",
"params": {"synthesis": True}
}
]
if fallback_level >= len(fallback_strategies):
# Exhausted all strategies; return initial docs with warning
return initial_docs, "LOW_QUALITY_DOCUMENTS"
strategy = fallback_strategies[fallback_level]
print(f"Escalating to: {strategy['name']}")
print(f" {strategy['description']}")
# Execute fallback retrieval
if strategy['params'].get('search_type') == 'web':
# In production, use a real web search API (Bing, Google)
fallback_docs = retriever_fn(query, **strategy['params'])
elif strategy['params'].get('synthesis'):
# Knowledge distillation: synthesize from heterogeneous sources
fallback_docs = synthesize_documents(query, initial_docs)
else:
fallback_docs = retriever_fn(query, **strategy['params'])
# Re-assess quality
quality = assess_retrieval_quality(query, fallback_docs)
if quality['quality_level'] in ['HIGH', 'MEDIUM']:
return fallback_docs, f"ESCALATED_TO_{strategy['name'].upper()}"
else:
# Try next strategy
return fallback_retrieval(query, fallback_docs, retriever_fn, fallback_level + 1)
def synthesize_documents(query: str, documents: list[str]) -> list[str]:
"""Generate synthetic documents using knowledge distillation."""
synthesis_prompt = """Using these documents and general knowledge,
synthesize a comprehensive answer to the query.
Query: {query}
Documents:
{docs_text}
Generate a synthetic document (200-300 words) that synthesizes
the key information and fills gaps from your training knowledge.""".format(
query=query,
docs_text="\n---\n".join(documents[:3])
)
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": synthesis_prompt}]
)
return [response.content[0].text]
def mock_retriever(query: str, **kwargs) -> list[str]:
"""Mock retriever for demonstration."""
if kwargs.get('search_type') == 'web':
return ["Web search result about " + query]
else:
return ["Broader search result about " + query]
# Example
initial = ["Document A (low relevance)"]
corrected, strategy = fallback_retrieval("Q3 2025 earnings by industry", mock_retriever, initial)
print(f"Strategy used: {strategy}")
Comparison: Retrieval with and without CRAG
| Scenario | Standard RAG | CRAG |
|---|---|---|
| High-quality initial retrieval | Answer correct (1 hop, 200 ms) | Answer correct (1 hop, 250 ms) |
| Medium-quality retrieval | Answer incomplete (60–70% accuracy) | Escalate + correct answer (85–90% accuracy) |
| Low-quality retrieval | Hallucination (30–40% accuracy) | Escalate → synthesis → correct answer (80–85% accuracy) |
| Average latency (all queries) | 250 ms | 300–400 ms (5–50% slower) |
| Hallucination rate | 8–12% | 2–4% |
CRAG adds 100–150 ms of latency on average but reduces hallucination and improves accuracy on hard questions by 30–50%.
Knowledge Distillation for Synthetic Documents
When retrieval fails completely, generate synthetic documents:
def knowledge_distillation(query: str, documents: list[str],
model_type: str = "distillation") -> str:
"""Distill knowledge from mixed sources into a synthetic document."""
if model_type == "distillation":
prompt = """Generate a synthetic reference document that would answer this query.
Query: {query}
Supporting sources (may be partial or low-quality):
{docs}
Synthesize a complete, coherent document that:
1. Answers the query directly
2. Incorporates accurate information from the sources
3. Fills gaps using your training knowledge
4. Acknowledges any uncertainty or missing information
Format as if from an authoritative source (e.g., official docs, research paper).""".format(
query=query,
docs="\n---\n".join(documents[:2])
)
else:
# Alternative: summarization-based distillation
prompt = f"""Summarize and synthesize these documents into a unified answer for: {query}"""
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=400,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Synthetic documents are lower quality than real sources but prevent complete failure. Use them as a last resort.
Integration: Complete CRAG Pipeline
def corrective_rag(query: str, retriever_fn, threshold: float = 0.6) -> dict:
"""Full CRAG pipeline."""
# Step 1: Initial retrieval
documents = retriever_fn(query)
# Step 2: Assess quality
quality = assess_retrieval_quality(query, documents)
# Step 3: Fallback if necessary
if quality['overall_quality'] / 100 < threshold:
documents, fallback_used = fallback_retrieval(
query, documents, retriever_fn
)
else:
fallback_used = "NONE"
# Step 4: Generate response
context = "\n---\n".join(documents[:3])
generation_prompt = f"""Based on these documents:
{context}
Answer: {query}"""
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": generation_prompt}]
).content[0].text
return {
"query": query,
"initial_quality": quality['overall_quality'],
"fallback_used": fallback_used,
"final_documents": len(documents),
"response": response
}
# Test
result = corrective_rag("Explain quantum computing", mock_retriever)
print(f"Quality: {result['initial_quality']}/100")
print(f"Fallback: {result['fallback_used']}")
Key Takeaways
- CRAG assesses retrieval quality and escalates to fallback strategies (broader search, web search, synthesis) when initial retrieval is insufficient.
- Use three-level fallback: broader search, web search, knowledge distillation. Most queries succeed at level 1–2.
- Knowledge distillation synthesizes documents when real retrieval fails, preventing hallucination at the cost of lower confidence.
- CRAG adds 50–150 ms latency but reduces hallucination from 8–12% to 2–4% and improves hard-question accuracy by 30–50%.
Frequently Asked Questions
What quality threshold should I use?
Start with 60–70%. Monitor user feedback: if 10%+ of answers are still unsatisfactory, lower to 50%. If 90%+ are excellent but latency is high, raise to 75–80%. Quality scores should be calibrated on your domain (50 medical papers is very different from 50 social media posts).
How expensive is CRAG compared to standard RAG?
Cost scales with fallback usage. If 70% of queries succeed at level 1 (standard retrieval), 25% at level 2 (web search), and 5% at level 3 (synthesis), average cost increases ~1.5–2x. Web search costs $0.001–0.002 per call; synthesis costs $0.003–0.005.
Can I use CRAG without a grading LLM?
Yes. Train a small classifier on 100–200 labeled examples (relevant/irrelevant) and use that instead of LLM grading. Classifier inference is 100x faster (5 ms vs 500 ms) but less flexible to domain shifts. For production, hybrid: classifier for speed, LLM for edge cases.
What if multiple fallback strategies return poor results?
This indicates your knowledge base is insufficient for the query. Options: (1) Log the query and prioritize content acquisition for that topic; (2) Return a honest response: "I don't have enough information to answer this. Here's what I found..."; (3) For web-searchable queries, always offer a web search link.
How do I prevent knowledge distillation from hallucinating?
Knowledge distillation works best when you have at least one relevant document. If retrieval is completely empty, don't synthesize—return an error. In code: if not documents: return "No information available" instead of synthesizing from nothing.
Further Reading
- Corrective Retrieval Augmented Generation (CRAG) — Yan et al., the original CRAG paper.
- Knowledge Distillation for Dense Retrieval — techniques for generating synthetic documents.
- Evaluating Hallucinations in LLMs — methods for detecting and preventing hallucination.
- Web Search Integration for RAG — combining web search with traditional RAG.