RAG Hallucination Detection: How to Identify False Content

Hallucination in RAG occurs when a generated answer contains information not present in retrieved passages. Unlike standalone LLM hallucinations, RAG hallucinations are measurable: you can check whether each claim in the answer is supported by sources. Detecting hallucinations is critical for production safety—users rely on RAG systems because they expect grounded answers, so a hallucinated claim is a broken promise.

Hallucinations manifest in several forms. Intrinsic hallucinations contradict retrieved passages (answer says X, passage says not-X). Extrinsic hallucinations add information not covered by passages (factually correct but unsupported). Factual errors are inherent to the answer, not the retrieval. For RAG evaluation, you focus on intrinsic and extrinsic hallucinations, not world-truth factuality.

Automatic Hallucination Detection

Simple automatic methods compare passages and answer at the token or semantic level. Intrinsic hallucinations (contradictions) are easier to detect than extrinsic ones (novel facts).

from typing import List, Tuple, Set

def detect_intrinsic_hallucinations_lexical(answer: str, 
                                           passages: List[str]) -> List[Tuple[str, str]]:
    """
    Detect intrinsic hallucinations via negation mismatch.
    
    Simple heuristic: look for "not" or "no" in passages but not in answer
    (or vice versa). Production systems use semantic entailment.
    
    Args:
        answer: Generated answer.
        passages: Retrieved passages.
    
    Returns:
        List of (claim, explanation) tuples for flagged hallucinations.
    """
    hallucinations = []
    
    answer_lower = answer.lower()
    passages_text = " ".join(passages).lower()
    
    # Naive heuristic: if answer mentions X but passages say "not X"
    # This is very basic and misses synonyms, passive voice, etc.
    
    # Example: check for contradiction about a specific entity
    if "does not cause" in passages_text and "causes" in answer_lower:
        # Potential contradiction detected
        hallucinations.append((
            answer, 
            "Answer may contradict passage (says 'causes', passage says 'does not cause')"
        ))
    
    return hallucinations

def detect_extrinsic_hallucinations_semantic(answer: str,
                                            passages: List[str],
                                            coverage_threshold: float = 0.7) -> Tuple[float, List[str]]:
    """
    Detect extrinsic hallucinations: claims not covered by passages.
    Uses token-level coverage as a proxy.
    
    Args:
        answer: Generated answer.
        passages: Retrieved passages.
        coverage_threshold: Fraction of answer words that should appear in passages.
    
    Returns:
        (coverage_score, unsupported_sentences) tuple.
    """
    passages_text = " ".join(passages).lower()
    answer_words = set(answer.lower().split())
    
    # Filter out stop words
    stop_words = {"the", "a", "is", "are", "was", "were", "be", "at", "to", "of", "in", "on"}
    content_words = {w for w in answer_words if w not in stop_words and len(w) > 3}
    
    if not content_words:
        return 1.0, []
    
    # Count how many content words appear in passages
    covered_words = sum(1 for w in content_words if w in passages_text)
    coverage_score = covered_words / len(content_words)
    
    # If coverage is low, answer may contain extrinsic hallucinations
    unsupported_sentences = []
    if coverage_score < coverage_threshold:
        # Simple heuristic: sentences with low word overlap are unsupported
        for sentence in answer.split("."):
            sentence_words = set(s.lower() for s in sentence.split())
            sentence_coverage = (
                sum(1 for w in sentence_words if w in passages_text) 
                / max(1, len(sentence_words))
            )
            if sentence_coverage < 0.3:
                unsupported_sentences.append(sentence.strip())
    
    return coverage_score, unsupported_sentences

# Example
answer = "Metformin is used for Type 2 diabetes and reduces cancer risk significantly."
passages = [
    "Metformin is the first-line medication for Type 2 diabetes.",
    "Studies show metformin may have cancer-protective effects, though more research is needed."
]

coverage, unsupported = detect_extrinsic_hallucinations_semantic(answer, passages)
print(f"Coverage: {coverage:.2f}")
print(f"Potentially unsupported: {unsupported}")

LLM-Based Hallucination Detection

For more reliable detection, use an LLM to verify claims. Provide the answer and passages, then ask the model to identify unsupported or contradicted claims.

import json

def llm_hallucination_detection(answer: str,
                               passages: List[str],
                               model_api_call) -> Dict:
    """
    Use an LLM to detect intrinsic and extrinsic hallucinations.
    
    Args:
        answer: Generated answer to check.
        passages: Retrieved passages.
        model_api_call: Function calling your LLM.
    
    Returns:
        Dict with 'hallucination_score' (0–1, higher = more hallucinated),
        'hallucinations' (list of detected issues), 'explanation' (str).
    """
    
    prompt = f"""
Analyze the following answer for hallucinations (claims unsupported or contradicted by passages).

Passages:
{chr(10).join(f"- {p}" for p in passages)}

Answer to verify:
{answer}

Identify all hallucinations. A hallucination is a claim that:
1. Contradicts the passages (intrinsic)
2. Is not mentioned or implied by the passages (extrinsic)

Respond with JSON:
{{
  "hallucination_score": <0.0–1.0, fraction of answer that is hallucinated>,
  "hallucinations": [
    {{"claim": "...", "type": "intrinsic|extrinsic", "explanation": "..."}}
  ],
  "explanation": "Overall assessment"
}}
"""
    
    response = model_api_call(prompt)
    result = json.loads(response)
    
    return result

# Example (pseudo-code)
# result = llm_hallucination_detection(answer, passages, my_llm)
# print(f"Hallucination score: {result['hallucination_score']:.2f}")
# for h in result['hallucinations']:
#     print(f"  - {h['claim']} ({h['type']})")

Hallucination Monitoring in Production

Deploy hallucination detection as a post-generation step. Flag high-hallucination answers and route them to human review or regeneration. Use aggregate hallucination metrics to detect degradation over time.

from dataclasses import dataclass
import statistics

@dataclass
class HallucinationMetrics:
    """Aggregate hallucination statistics."""
    mean_score: float  # Average hallucination score across examples
    pct_flagged: float  # Percentage of answers with hallucinations
    max_score: float   # Worst hallucination in batch
    trend: str         # "stable", "worsening", "improving"

def compute_hallucination_metrics(hallucination_scores: List[float],
                                 threshold: float = 0.3) -> HallucinationMetrics:
    """
    Compute aggregate metrics from per-query hallucination scores.
    
    Args:
        hallucination_scores: List of hallucination scores (0–1) per answer.
        threshold: Score above which an answer is flagged.
    
    Returns:
        HallucinationMetrics object.
    """
    
    if not hallucination_scores:
        return HallucinationMetrics(0.0, 0.0, 0.0, "stable")
    
    mean_score = statistics.mean(hallucination_scores)
    pct_flagged = sum(1 for s in hallucination_scores if s > threshold) / len(hallucination_scores)
    max_score = max(hallucination_scores)
    
    # Trend detection (would compare to historical baseline)
    trend = "stable"  # Placeholder
    
    return HallucinationMetrics(mean_score, pct_flagged, max_score, trend)

# Example
scores = [0.1, 0.05, 0.3, 0.4, 0.15, 0.8]  # Hallucination scores for 6 answers
metrics = compute_hallucination_metrics(scores, threshold=0.3)
print(f"Mean hallucination: {metrics.mean_score:.2f}")
print(f"Flagged: {metrics.pct_flagged:.1%}")
print(f"Max: {metrics.max_score:.2f}")

Combining Hallucination Detection with Grounding

Use hallucination detection in concert with citation enforcement: if you require answers to cite sources and check those citations against retrieved passages, you catch hallucinations before they reach users.

def hallucination_check_with_citations(answer: str,
                                       citations: List[str],
                                       passages: List[str]) -> Tuple[bool, str]:
    """
    Check if answer's claimed citations actually support the answer.
    
    Args:
        answer: Generated answer.
        citations: List of cited passage excerpts.
        passages: Full retrieved passages.
    
    Returns:
        (is_grounded, reason) tuple.
    """
    
    # Verify each citation appears in passages
    for citation in citations:
        found = any(citation.lower() in passage.lower() for passage in passages)
        if not found:
            return False, f"Citation not found in passages: {citation}"
    
    # Verify answer content is supported by citations
    answer_lower = answer.lower()
    citations_text = " ".join(citations).lower()
    
    # Heuristic: key claims in answer should appear in citations
    # (Production systems use more sophisticated entailment checks)
    
    return True, "Citations verify answer grounding"

# Example
answer = "Metformin is used for Type 2 diabetes and reduces cancer risk."
citations = [
    "Metformin is the first-line medication for Type 2 diabetes.",
    "Studies suggest metformin may have cancer-protective effects."
]
passages = citations  # Simplified; real system retrieves full passages

grounded, reason = hallucination_check_with_citations(answer, citations, passages)
print(f"Grounded: {grounded}, Reason: {reason}")

Key Takeaways

Hallucinations in RAG are measurable: intrinsic (contradictions) and extrinsic (unsupported claims).
Automatic detection via token overlap or coverage is fast but misses subtle hallucinations; semantic and LLM-based methods are more accurate.
Combine hallucination detection with citation enforcement to catch errors early.
Monitor hallucination metrics in production; alert when hallucination rate or maximum score exceeds thresholds.
For critical applications, route high-hallucination answers to human review before serving.

Frequently Asked Questions

How do I distinguish between a hallucination and a knowledge gap?

Both manifest as unsupported claims, but the handling differs. A knowledge gap is an honest "I don't know" (answer should say "I cannot find information about X"). A hallucination is a false claim presented as fact. Good RAG systems should defer when context relevance is low rather than hallucinate.

Can I fully eliminate hallucinations with RAG?

No. RAG prevents hallucinations about facts in the corpus but cannot prevent hallucinations about knowledge outside the corpus (if the corpus lacks information, the generator may fabricate). However, RAG dramatically reduces hallucinations compared to closed-domain LLMs.

What is a reasonable hallucination detection threshold in production?

Flag and review any answer with hallucination score above 0.3 (30% estimated hallucination). Below 0.1 is considered safe. Between 0.1 and 0.3, monitor trends; if mean hallucination rises above 0.2, investigate your retriever or generator.

Should I retry generation if hallucination is detected?

Yes, for production systems. If hallucination detection flags an answer, retry generation with the same query and retrieved passages (or trigger reranking to surface more relevant passages). This often resolves extrinsic hallucinations without full retrieval re-run.

Automatic Hallucination Detection​

LLM-Based Hallucination Detection​

Hallucination Monitoring in Production​

Combining Hallucination Detection with Grounding​

Key Takeaways​

Frequently Asked Questions​

How do I distinguish between a hallucination and a knowledge gap?​

Can I fully eliminate hallucinations with RAG?​

What is a reasonable hallucination detection threshold in production?​

Should I retry generation if hallucination is detected?​

Further Reading​