Self-RAG: Adaptive Retrieval and Response Grading
SELF-RAG (Self-Reflective Retrieval-Augmented Generation) is a framework where the model itself decides when and what to retrieve, then grades its own responses for quality. Instead of blindly retrieving documents and generating answers, SELF-RAG has the LLM answer three questions for each response: Is retrieval needed? Are the retrieved documents relevant? Is the final answer supported by the documents? This adaptive approach reduces hallucination by 30–40% while maintaining speed because unnecessary retrievals are skipped (Asai et al., 2023).
The SELF-RAG Framework
SELF-RAG introduces four decision points in the generation pipeline:
- Retrieval decision: Should the model retrieve documents before answering? (Yes/No)
- Relevance grading: Are the retrieved documents relevant to the query? (Relevant/Partially/Irrelevant)
- Support grading: Is the generated response supported by the documents? (Fully/Partially/Not)
- Response grading: Overall quality of the answer (Excellent/Good/Acceptable/Bad)
By making these decisions explicit, the model learns to retrieve only when needed and to admit uncertainty when evidence is lacking.
Retrieval Decision
First, decide whether retrieval is necessary:
from anthropic import Anthropic
client = Anthropic()
def should_retrieve(query: str, context: str = "") -> tuple[bool, str]:
"""Decide whether retrieval is needed for this query."""
decision_prompt = """Analyze this query and decide: Do you need to retrieve documents
to answer it accurately?
Reasons to retrieve:
- The query asks for factual information (names, dates, statistics)
- The query requires up-to-date information (current events, recent data)
- The query is complex and requires external context
Reasons NOT to retrieve:
- The query asks for general knowledge (how to think, philosophy, methodology)
- The query can be answered from training knowledge confidently
- The query is about how to do something with standard techniques
Query: {query}
Context: {context}
Respond with: RETRIEVE or SKIP, followed by a brief explanation.""".format(
query=query, context=context
)
response = client.messages.create(
model="claude-haiku", # Fast decision-making
max_tokens=50,
messages=[{"role": "user", "content": decision_prompt}]
)
text = response.content[0].text.upper()
should_retrieve = "RETRIEVE" in text
return should_retrieve, response.content[0].text
# Example usage
queries = [
"What is machine learning?", # General knowledge, skip retrieval
"What were the Q3 2025 earnings of Microsoft?", # Factual, retrieve
"Explain the theory behind neural networks", # General knowledge, skip
]
for q in queries:
should_ret, reason = should_retrieve(q)
print(f"Query: {q}")
print(f"Retrieve: {should_ret}\nReason: {reason}\n")
Skipping unnecessary retrievals saves 200–400 ms per query. For general knowledge questions, SELF-RAG skips retrieval 40–50% of the time.
Relevance Grading
After retrieval, grade the retrieved documents:
def grade_relevance(query: str, documents: list[str]) -> list[dict]:
"""Grade the relevance of retrieved documents to the query."""
grading_prompt = """For each document, grade its relevance to the query.
Query: {query}
Document 1: {doc1}
Is this document relevant? (RELEVANT / PARTIALLY_RELEVANT / IRRELEVANT)
Why?
Document 2: {doc2}
Is this document relevant? (RELEVANT / PARTIALLY_RELEVANT / IRRELEVANT)
Why?
[Repeat for all documents]
Respond with grades and explanations.""".format(
query=query,
doc1=documents[0][:300] if documents else "",
doc2=documents[1][:300] if len(documents) > 1 else ""
)
response = client.messages.create(
model="claude-haiku",
max_tokens=300,
messages=[{"role": "user", "content": grading_prompt}]
)
# Parse grades from response (simplified)
text = response.content[0].text
grades = []
for i, doc in enumerate(documents):
if "RELEVANT" in text.upper():
grade = "RELEVANT"
elif "PARTIALLY" in text.upper():
grade = "PARTIALLY_RELEVANT"
else:
grade = "IRRELEVANT"
grades.append({"doc_id": i, "grade": grade, "confidence": 0.85})
return grades
# Example
docs = [
"Microsoft reported Q3 2025 revenue of $67.2B, a 12% YoY increase...",
"The history of cloud computing dates back to the 1960s..."
]
grades = grade_relevance("Q3 2025 Microsoft earnings", docs)
for g in grades:
print(f"Document {g['doc_id']}: {g['grade']}")
Based on grades, you can filter out irrelevant documents or trigger additional retrieval. If all documents are IRRELEVANT, acknowledge the limitation in the response.
Support Grading
After generating a response, grade whether it's supported by retrieved documents:
def grade_support(query: str, response: str, documents: list[str]) -> tuple[str, float]:
"""Grade whether the response is supported by the documents."""
support_prompt = """Analyze whether this response is supported by the provided documents.
Query: {query}
Response: {response}
Supporting Documents:
{docs_text}
Grade the support level:
- FULLY_SUPPORTED: All major claims are backed by the documents
- PARTIALLY_SUPPORTED: Some claims are in documents, others require external knowledge
- NOT_SUPPORTED: Most claims are not in the documents
Respond with: GRADE: <grade>, CONFIDENCE: <0.0-1.0>
Explanation: <brief reason>""".format(
query=query,
response=response,
docs_text="\n---\n".join([f"Doc {i}: {d[:200]}" for i, d in enumerate(documents)])
)
response_obj = client.messages.create(
model="claude-opus-4-1", # Use larger model for nuanced grading
max_tokens=200,
messages=[{"role": "user", "content": support_prompt}]
)
text = response_obj.content[0].text
# Extract grade and confidence
if "FULLY_SUPPORTED" in text.upper():
grade = "FULLY_SUPPORTED"
elif "PARTIALLY" in text.upper():
grade = "PARTIALLY_SUPPORTED"
else:
grade = "NOT_SUPPORTED"
# Extract confidence (simplified)
import re
match = re.search(r"CONFIDENCE:\s*([\d.]+)", text)
confidence = float(match.group(1)) if match else 0.5
return grade, confidence
# Example
query = "What is Microsoft's Q3 2025 revenue?"
response = "Microsoft reported Q3 2025 revenue of $67.2B."
grade, confidence = grade_support(query, response, docs)
print(f"Support Grade: {grade} (confidence: {confidence:.2f})")
If support is low, the model should either revise its response to match the documents or add a disclaimer: "Based on available documents, I can partially answer this."
Complete SELF-RAG Pipeline
Integrate all components:
def self_rag_generate(query: str, retriever_fn) -> dict:
"""Full SELF-RAG pipeline: decide → retrieve → grade → generate."""
# Step 1: Decide whether to retrieve
should_ret, _ = should_retrieve(query)
if not should_ret:
# Answer without retrieval
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": query}]
).content[0].text
return {
"query": query,
"retrieval_used": False,
"response": response,
"support_grade": "N/A"
}
# Step 2: Retrieve documents
documents = retriever_fn(query)
# Step 3: Grade relevance
relevance_grades = grade_relevance(query, documents)
relevant_docs = [
documents[g["doc_id"]]
for g in relevance_grades
if g["grade"] in ["RELEVANT", "PARTIALLY_RELEVANT"]
]
if not relevant_docs:
# No relevant documents; answer based on training knowledge
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": query}]
).content[0].text
return {
"query": query,
"retrieval_used": True,
"documents_found": len(documents),
"relevant_docs": 0,
"response": response,
"support_grade": "NO_RELEVANT_DOCUMENTS"
}
# Step 4: Generate response with relevant documents as context
context = "\n---\n".join(relevant_docs[:3])
generation_prompt = f"""Based on these documents:
{context}
Answer the query: {query}"""
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=300,
messages=[{"role": "user", "content": generation_prompt}]
).content[0].text
# Step 5: Grade support
support_grade, support_confidence = grade_support(query, response, relevant_docs)
return {
"query": query,
"retrieval_used": True,
"documents_found": len(documents),
"relevant_docs": len(relevant_docs),
"response": response,
"support_grade": support_grade,
"support_confidence": support_confidence
}
def mock_retriever(q: str) -> list[str]:
return ["Document A", "Document B"]
# Example
result = self_rag_generate("What is Q3 2025 revenue?", mock_retriever)
print(f"Response: {result['response']}")
print(f"Support Grade: {result.get('support_grade', 'N/A')}")
Comparison: Standard RAG vs. SELF-RAG
| Metric | Standard RAG | SELF-RAG |
|---|---|---|
| Hallucination rate | 8–12% | 3–5% |
| Avg retrieval calls | 1.0 per query | 0.6–0.8 per query |
| Latency | 400–600 ms | 300–500 ms |
| Retrieval accuracy | 80–85% | 85–90% |
| Cost per query | $0.003–0.005 | $0.002–0.004 |
| Transparency | Limited | High (explicit grades) |
SELF-RAG trades minimal latency increase for significant quality gains through adaptive retrieval and grading.
Key Takeaways
- SELF-RAG uses four adaptive decisions (retrieve?, relevant?, supported?, quality?) to reduce hallucination by 30–40%.
- Skip retrieval for general knowledge questions; this cuts latency and cost 40% of the time.
- Grade retrieved documents for relevance and generated responses for factual support.
- If support is low, revise or add disclaimers; this maintains user trust in RAG systems.
- SELF-RAG trades minimal latency for significant transparency and quality improvements.
Frequently Asked Questions
How accurate are the grading decisions?
Grading decisions (2–3 label classification) are highly accurate with LLMs: 90–95% inter-annotator agreement with human judges on relevance and support grading. Use a smaller model (Haiku) for retrieval decisions (simpler task); a larger model (Opus) for support grading (requires reasoning).
What if the model disagrees with my retrieval strategy?
SELF-RAG learns retrieval importance from its training data. If your domain has unusual requirements (e.g., always retrieve for all queries), add domain-specific instructions: "For medical queries, always retrieve from your medical knowledge base, even for general questions."
Can I use SELF-RAG without explicit grading labels?
Yes. Train a separate classifier on labeled data (50–100 examples of relevant/irrelevant documents) and use that instead of LLM grading. This is 10x faster but requires upfront labeling effort. For production systems at scale, labeled classifiers are standard.
How do I handle conflicting grades (document says RELEVANT but generated response says NOT_SUPPORTED)?
This indicates a mismatch: the document is topically relevant but doesn't support the specific claim. In this case, re-generate the response with explicit instruction: "Ground your answer in the provided documents. If you can't support a claim, say so."
Should I show grades to end users?
Typically no. Grades are for internal quality assurance. You might show a confidence indicator (low/medium/high) or a disclaimer ("This answer is based on X documents").
Further Reading
- SELF-RAG: Learning to Retrieve, Generate, and Critique for Self-Improved Generation — original SELF-RAG paper by Asai et al.
- Dense Passage Retrieval for Open-Domain Question Answering — foundational work on relevance scoring in RAG.
- Evaluating Factuality in Abstractive Summarization — techniques for support grading and factuality evaluation.
- LLM as Evaluator: Assessing Language Models on Self-Supervised QA — using LLMs as graders/evaluators.