What Is Faithfulness Scoring in RAG?
Faithfulness scoring measures whether a RAG-generated answer is grounded in the retrieved documents. An answer is faithful if every factual claim in it is supported by at least one retrieved passage. Faithfulness is the core problem RAG systems solve—by grounding answers in retrieval, you prevent hallucinations. However, measuring faithfulness is non-trivial: a claim might be factually correct but unsupported by retrieved documents (hallucinated), or it might be poorly phrased despite being grounded.
Faithfulness differs from correctness. An answer can be faithful to retrieved documents but incorrect if the documents themselves are wrong. Conversely, an answer can be correct but unfaithful (contradicting the retrieved passages). For production systems, faithfulness is the primary concern: if you ground answers in sources, you delegate correctness to those sources.
The Faithfulness Problem in RAG
The core challenge is decomposing an answer into atomic claims and checking each against retrieved passages. Consider this answer: "Aspirin is a nonsteroidal anti-inflammatory drug used to treat headaches and reduce cardiovascular risk." This contains three claims: (1) aspirin is an NSAID, (2) it treats headaches, and (3) it reduces cardiovascular risk. If retrieved documents support (1) and (2) but only vaguely mention (3), the answer is partially faithful.
Manual evaluation is expensive. Automatic evaluation typically uses one of two approaches: token overlap (do passages contain words from the answer?) or semantic similarity (do passages semantically entail the answer's claims?). Both are imperfect. Token overlap has high false negatives (synonymous passages are missed). Semantic similarity requires careful tuning and can be fooled by negations ("aspirin does NOT reduce risk").
Automatic Faithfulness: Token Overlap and Similarity
The simplest approach is token overlap. Extract question-relevant content words from the answer, then check if passages contain those words. This is fast but catches only obvious hallucinations.
from typing import List, Set
import re
def extract_claims_simple(answer: str) -> Set[str]:
"""
Extract nouns and verbs from answer as proxy for claims.
Simple heuristic; production systems use NLP parsers.
"""
words = answer.lower().split()
# Keep words longer than 4 chars (rough filter for content words)
claims = {w for w in words if len(w) > 4 and w.isalpha()}
return claims
def token_overlap_faithfulness(answer: str,
passages: List[str]) -> float:
"""
Compute faithfulness as fraction of answer claims in passages.
Args:
answer: Generated answer text.
passages: List of retrieved passages (full text).
Returns:
Overlap score 0.0–1.0. Higher = more faithful.
"""
claims = extract_claims_simple(answer)
if not claims:
return 1.0 # No claims = trivially faithful
passages_text = " ".join(passages).lower()
supported_claims = sum(1 for claim in claims
if claim in passages_text)
return supported_claims / len(claims)
# Example
answer = "Aspirin is a nonsteroidal anti-inflammatory drug used for headaches."
passages = [
"Aspirin, a nonsteroidal anti-inflammatory drug, is commonly used to treat pain and reduce fever.",
"Doctors recommend aspirin for patients with cardiovascular disease."
]
score = token_overlap_faithfulness(answer, passages)
print(f"Token overlap faithfulness: {score:.2f}") # Rough estimate
Better approaches use semantic similarity. Decompose the answer into entailment triplets (subject, predicate, object) and check if each triplet is entailed by retrieved documents using a cross-encoder.
from sentence_transformers import CrossEncoder
def semantic_faithfulness(answer: str,
passages: List[str],
model_name: str = "cross-encoder/mmarco-mMiniLMv2-L12-H384-v1") -> float:
"""
Compute faithfulness using semantic entailment scoring.
Args:
answer: Generated answer.
passages: Retrieved passages.
model_name: Cross-encoder model for entailment.
Returns:
Faithfulness score 0.0–1.0 based on entailment.
"""
model = CrossEncoder(model_name)
# Treat answer as a claim to be entailed by passages
passage_text = " ".join(passages)
# Score how much passage text entails the answer
# (This is a simplified proxy; better implementations decompose
# the answer into sub-claims first.)
score = model.predict([[passage_text, answer]])[0]
# Normalize to 0–1 (cross-encoder scores typically range -1 to 1)
return max(0.0, min(1.0, (score + 1) / 2))
# Example usage (requires HuggingFace model download)
# score = semantic_faithfulness(answer, passages)
# print(f"Semantic faithfulness: {score:.3f}")
LLM-Based Faithfulness Scoring
Modern approaches use a language model to evaluate faithfulness. You provide the answer and passages to a strong LLM (Claude, GPT-4) and ask it to judge whether each claim is supported. This is more accurate than automatic metrics but requires API calls (slower, higher cost).
import json
from typing import Dict, List
def llm_faithfulness_evaluation(answer: str,
passages: List[str],
model_api_call) -> Dict:
"""
Use an LLM to evaluate faithfulness.
Args:
answer: Generated answer to evaluate.
passages: Retrieved passages.
model_api_call: Function that calls your LLM
(e.g., OpenAI API or Anthropic).
Returns:
Dict with 'faithful' (boolean), 'explanation' (str),
'unsupported_claims' (list of str).
"""
prompt = f"""
You are an expert fact-checker. Evaluate whether the following answer is
faithful to the provided passages. An answer is faithful if every factual
claim is explicitly supported by at least one passage.
Passages:
{chr(10).join(f"- {p}" for p in passages)}
Answer:
{answer}
Respond with a JSON object containing:
- "faithful" (boolean): Is the answer faithful?
- "explanation" (string): Brief explanation of your judgment.
- "unsupported_claims" (list of strings): Factual claims not supported by passages.
Example:
{{"faithful": false, "explanation": "Claim 3 is unsupported.", "unsupported_claims": ["aspirin reduces cancer risk"]}}
"""
# Call your LLM here (pseudo-code)
response = model_api_call(prompt)
result = json.loads(response)
return result
# Example (pseudo-code, requires actual LLM setup)
# result = llm_faithfulness_evaluation(answer, passages, my_llm_api)
# print(f"Faithful: {result['faithful']}")
# print(f"Unsupported: {result['unsupported_claims']}")
Choosing Automatic vs. LLM-Based Scoring
Use automatic metrics (token overlap, semantic similarity) for fast offline evaluation and continuous monitoring. They scale to large datasets and return quickly. However, they miss subtle hallucinations like negation errors ("X does not increase Y" → answer says "X increases Y").
Use LLM-based scoring for spot-checks (10–20% of examples) before releases and for detailed analysis of failures. LLM scores are slow (~10 examples/min with API latency) but more accurate. Combine both: run automatic metrics for regression testing, then use LLM evaluation on flagged examples.
Key Takeaways
- Faithfulness measures whether answer claims are supported by retrieved passages, not whether they are absolutely correct.
- Token overlap and semantic similarity are fast but imperfect; they catch obvious hallucinations but miss subtle unsupported claims.
- LLM-based evaluation is slower but more accurate and can identify unsupported claims with explanations.
- For production systems, hybrid approach: automatic metrics for monitoring, LLM evaluation for spot-checks.
- Faithfulness is binary per claim but is typically reported as an aggregate score (fraction of claims supported).
Frequently Asked Questions
What is the difference between faithfulness and accuracy?
Faithfulness means the answer is grounded in retrieved passages; accuracy means the answer is factually correct in the world. A faithful answer might be inaccurate if the source documents are wrong. For RAG systems, ensuring faithfulness is more important—you cannot guarantee the truth of sources, but you can ensure answers are grounded in them.
How do I handle negations in faithfulness scoring?
Negations are a major failure mode for automatic metrics. "Aspirin does not increase cancer risk" is the negation of "Aspirin increases cancer risk." Token overlap would miss this. Use semantic entailment models (cross-encoders) or LLM-based evaluation to handle negations correctly.
Can I measure faithfulness without the retrieved passages?
Not reliably. Faithfulness is inherently relative to sources. To measure whether an answer is grounded in retrieval, you must compare it to the passages. For general factual correctness (independent of sources), use different metrics like FactKG (knowledge graph-based fact checking).
Should I penalize shorter answers in faithfulness scoring?
Not directly. Faithfulness is a per-claim metric. A short answer with three claims all supported is more faithful than a long answer with ten claims and two unsupported. However, you may want to measure coverage separately: does the answer address all aspects of the question?
Further Reading
- On Faithfulness and Factuality in Abstractive Summarization (Krayewski et al., 2022) — Deep analysis of faithfulness in neural text generation.
- Evaluating Factuality in Generation with Dependency-level Entailment (Goyal & Durrett, 2021) — Fine-grained faithfulness via entailment.
- QAGS: A Question Answering and Generation System for Summarization (Wang et al., 2020) — Question-based faithfulness evaluation for QA and summarization.