Skip to main content

RAG Evaluation Metrics: Step-by-Step Guide

RAG evaluation metrics quantify how well a retrieval-augmented generation system combines search results with generation to answer user queries correctly and without hallucination. A complete RAG evaluation framework measures retrieval quality (did the system find relevant documents?), generation quality (is the answer faithful to retrieved content?), and grounding quality (are claims attributed to sources?). Unlike evaluating a standalone LLM, RAG adds complexity: you must verify both that the retriever fetched the right passages and that the generator faithfully used them.

I built my first RAG system in 2024 without systematic evaluation. After shipping, users reported answers that sounded confident but contradicted the source documents. That incident taught me that visual spot-checks are insufficient. You need quantitative metrics tied to golden datasets, automated scorers, and regression tests.

Why RAG Evaluation Is Different

Evaluating RAG differs from evaluating a closed-domain language model in three ways. First, retrieval quality directly affects generation quality—a perfect generator cannot recover from a failed retrieval. Second, groundedness becomes measurable: you can check whether the generated answer directly references retrieved passages. Third, source attribution is now a first-class feature, not an afterthought.

Consider a medical RAG system. A standard LLM evaluation might ask: "Is the answer medically accurate?" But a RAG evaluation must also verify: "Did the system retrieve the relevant clinical guidelines?" and "Did the generated answer cite those guidelines?"

Core RAG Evaluation Metrics

RAG evaluation combines three metric families: retrieval metrics, generation metrics, and grounding metrics. Retrieval metrics like precision, recall, and nDCG measure whether the retriever found relevant documents. Generation metrics like faithfulness and relevance measure whether the answer accurately reflects retrieved content. Grounding metrics measure citation coverage and attribution accuracy.

Retrieval metrics (precision, recall, nDCG) assume you have a gold-standard list of relevant documents per query. Precision measures the proportion of retrieved documents that are actually relevant. Recall measures the proportion of all relevant documents that the system retrieved. Normalized Discounted Cumulative Gain (nDCG) rewards placing highly relevant documents earlier in the ranking.

Generation metrics (faithfulness, answer relevance) measure the quality of the final generated answer. Faithfulness evaluates whether the answer is grounded in the retrieved documents—does it avoid fabricating facts? Answer relevance evaluates whether the answer actually addresses the user's question.

Grounding metrics (citation precision, citation recall, answer recall) measure whether claims in the answer are traceable to source documents and whether citations cover all key facts.

Setting Up a Minimal RAG Evaluation Pipeline

A production RAG evaluation setup requires: (1) a golden dataset (query-document pairs with ground-truth answers), (2) metric scorers (functions that compute metrics), (3) a baseline to compare against, and (4) a regression test harness.

import json
from typing import List, Dict
from dataclasses import dataclass

@dataclass
class RAGExample:
"""A single query with reference documents and expected answer."""
query: str
reference_docs: List[str] # Gold standard relevant documents
ground_truth_answer: str # Expected correct answer
expected_citations: List[str] # Which passages should be cited

class RAGEvaluator:
"""Minimal RAG evaluation harness."""

def __init__(self, metric_names: List[str]):
self.metric_names = metric_names
self.results = []

def evaluate_example(self, example: RAGExample,
retrieved_docs: List[str],
generated_answer: str) -> Dict:
"""Score one example across all metrics."""
scores = {}

# Retrieval metric: compute precision
relevant_retrieved = sum(
1 for doc in retrieved_docs
if doc in example.reference_docs
)
scores['retrieval_precision'] = (
relevant_retrieved / len(retrieved_docs)
if retrieved_docs else 0
)

# Grounding metric: citation presence (basic check)
answer_lower = generated_answer.lower()
cited_count = sum(
1 for passage in example.expected_citations
if passage.lower() in answer_lower
)
scores['citation_recall'] = (
cited_count / len(example.expected_citations)
if example.expected_citations else 1.0
)

return scores

def evaluate_dataset(self, examples: List[RAGExample],
predictions: List[Dict]) -> Dict:
"""Compute aggregate metrics over a dataset."""
all_scores = {metric: [] for metric in self.metric_names}

for ex, pred in zip(examples, predictions):
scores = self.evaluate_example(
ex,
pred['retrieved_docs'],
pred['answer']
)
for metric, score in scores.items():
all_scores[metric].append(score)

# Compute averages
summary = {}
for metric, values in all_scores.items():
summary[f'{metric}_mean'] = (
sum(values) / len(values) if values else 0
)

return summary

# Example usage
examples = [
RAGExample(
query="What is the capital of France?",
reference_docs=["France is a country in Western Europe. Its capital is Paris."],
ground_truth_answer="Paris is the capital of France.",
expected_citations=["capital is Paris"]
)
]

evaluator = RAGEvaluator(['retrieval_precision', 'citation_recall'])
predictions = [
{
'retrieved_docs': ["France is a country in Western Europe. Its capital is Paris."],
'answer': 'Paris is the capital of France.'
}
]

results = evaluator.evaluate_dataset(examples, predictions)
print(json.dumps(results, indent=2))

Organizing Metrics by RAG Stage

Organize your evaluation around the RAG pipeline stages. At the retrieval stage, measure precision and recall. At the reranking stage (if used), measure nDCG and Mean Reciprocal Rank. At the generation stage, measure faithfulness, relevance, and hallucination rate. At the post-generation stage, measure citation coverage and source attribution accuracy.

A complete evaluation report should break down performance by stage, so you can pinpoint failures. If retrieval precision drops, the issue is document ranking. If faithfulness drops with constant retrieval quality, the issue is generation.

Key Takeaways

  • RAG evaluation requires three metric families: retrieval metrics (precision, recall, nDCG), generation metrics (faithfulness, relevance), and grounding metrics (citation coverage, attribution).
  • Golden datasets are foundational—build them with domain experts and version-control them alongside your code.
  • A minimal evaluation harness computes aggregate metrics and enables regression testing against baselines.
  • Organize evaluation by pipeline stage (retrieval → reranking → generation → grounding) to isolate failure modes.
  • Baseline all RAG changes against your previous best metrics, never ship blind.

Frequently Asked Questions

What is a golden dataset in RAG evaluation?

A golden dataset is a curated collection of query-document-answer triples where domain experts have manually verified that the reference documents are relevant and the ground-truth answer is correct. It serves as the benchmark against which all RAG versions are compared. Golden datasets should be diverse (cover edge cases, domain breadth, query types) and version-controlled.

How many examples do I need in a golden dataset?

For initial baseline and regression testing, aim for 50–100 examples per domain or use case. This is enough to detect major regressions (1–2% drops in average metrics) with statistical confidence. For critical systems (medical, legal), target 200–500 examples and stratify by query difficulty.

Can I use automatic metrics alone, or do I need human evaluation?

Automatic metrics are fast and reproducible, making them essential for regression testing and daily monitoring. However, automatic metrics (especially faithfulness) have known blind spots. For releases, supplement automatic metrics with targeted human review (10–20% of examples) to catch failure modes the metrics miss.

What if my golden dataset is small?

Start with what you have. Use stratified k-fold cross-validation (e.g., 5 folds) to maximize the signal from limited examples. As you grow, collect more examples incrementally—prioritize hard negatives (queries that fail your current system) and domain edge cases.

Further Reading