Evaluating RAG Systems: Metrics That Matter
You cannot improve what you do not measure. Without metrics, you cannot detect when your RAG system degrades, optimize components, or compare different architectures. RAG evaluation is complex because you must measure two things: whether retrieval found the right documents (retrieval quality) and whether the LLM generated correct answers given those documents (generation quality). This article covers the metrics, tooling, and best practices for comprehensive RAG evaluation.
The Two-Stage Evaluation Pipeline
RAG evaluation breaks into two stages, each with different metrics:
Stage 1: Retrieval Evaluation — Did we fetch the relevant documents?
- Precision@K: Of the top-K retrieved documents, how many are relevant?
- Recall: Of all relevant documents in the index, what fraction did we retrieve?
- NDCG (Normalized Discounted Cumulative Gain): How well-ranked are relevant documents?
- MRR (Mean Reciprocal Rank): What is the rank of the first relevant document?
Stage 2: Generation Evaluation — Did the LLM generate a correct answer?
- Exact Match (EM): Does the answer exactly match the ground truth?
- F1 Score: What fraction of the ground truth words appear in the answer?
- ROUGE: Overlap between generated and reference answers.
- BLEU: Precision of n-grams in generated vs. reference answer.
- Semantic Similarity (BERTScore): Do the answer and reference have similar meaning?
Let us focus on practical metrics for 2026 RAG systems.
Building an Evaluation Dataset
Before running metrics, you need ground truth: a set of (query, expected_answer, expected_sources) tuples. This is manual work, but essential.
import json
from datetime import datetime
# Manually curate a small but diverse test set
evaluation_dataset = [
{
"query_id": "q1",
"query": "What are the benefits of async/await in Python?",
"expected_answer_snippets": [
"concurrent I/O operations",
"without threading overhead",
"cooperative multitasking"
],
"expected_sources": ["async-guide.md", "python-concurrency.pdf"],
"difficulty": "easy",
"category": "language-features"
},
{
"query_id": "q2",
"query": "How do I configure Pinecone for billion-scale vector search?",
"expected_answer_snippets": [
"index sharding",
"replicas",
"hybrid storage"
],
"expected_sources": ["pinecone-docs.html"],
"difficulty": "hard",
"category": "infrastructure"
},
# ... 100+ more curated questions
]
def save_evaluation_dataset(data: list[dict], filepath: str) -> None:
"""Save evaluation dataset as JSON."""
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
save_evaluation_dataset(evaluation_dataset, "rag_evaluation_set.json")
For a production RAG system, curate 100–500 diverse questions covering your domain. Recruit human annotators to agree on expected answers (inter-rater agreement should exceed 80%).
Evaluating Retrieval Quality
from typing import NamedTuple
import numpy as np
class RetrievalMetrics(NamedTuple):
precision_at_5: float
precision_at_10: float
recall_at_10: float
ndcg_at_10: float
mrr: float
def compute_retrieval_metrics(
retrieved_docs: list[str], # Retrieved document IDs, in order
relevant_docs: list[str], # Ground truth relevant document IDs
k: int = 10
) -> RetrievalMetrics:
"""Compute standard retrieval evaluation metrics."""
retrieved_at_k = set(retrieved_docs[:k])
relevant_set = set(relevant_docs)
# Precision@K: fraction of top-K results that are relevant
num_relevant_in_top_k = len(retrieved_at_k & relevant_set)
precision_at_k = num_relevant_in_top_k / k
# Recall@K: fraction of all relevant docs that appear in top-K
recall_at_k = num_relevant_in_top_k / len(relevant_set) if relevant_set else 0
# Mean Reciprocal Rank (MRR): rank of first relevant document
mrr = 0.0
for rank, doc in enumerate(retrieved_docs, 1):
if doc in relevant_set:
mrr = 1 / rank
break
# NDCG: discounted cumulative gain, normalized by ideal ranking
def dcg(ranks: list[int], k: int) -> float:
"""Compute discounted cumulative gain."""
dcg_score = 0
for i, relevance in enumerate(ranks[:k], 1):
dcg_score += relevance / np.log2(i + 1)
return dcg_score
# Relevance: 1 if retrieved doc is relevant, 0 otherwise
relevance_scores = [1 if doc in relevant_set else 0 for doc in retrieved_docs]
actual_dcg = dcg(relevance_scores, k)
# Ideal DCG: all relevant docs ranked first
ideal_relevance = [1] * len(relevant_set) + [0] * (k - len(relevant_set))
ideal_dcg = dcg(ideal_relevance, k)
ndcg = actual_dcg / ideal_dcg if ideal_dcg > 0 else 0
return RetrievalMetrics(
precision_at_5=num_relevant_in_top_k / 5, # Simplified
precision_at_10=precision_at_k,
recall_at_10=recall_at_k,
ndcg_at_10=ndcg,
mrr=mrr
)
# Example
retrieved = ["doc_1", "doc_2", "doc_5", "doc_3", "doc_7", "doc_8", "doc_4"]
relevant = ["doc_1", "doc_2", "doc_3", "doc_4"]
metrics = compute_retrieval_metrics(retrieved, relevant)
print(f"Precision@10: {metrics.precision_at_10:.3f}")
print(f"Recall@10: {metrics.recall_at_10:.3f}")
print(f"NDCG@10: {metrics.ndcg_at_10:.3f}")
print(f"MRR: {metrics.mrr:.3f}")
Evaluating Generation Quality
from bert_score import score as bert_score
import rouge_score
def evaluate_answer_quality(
generated_answer: str,
reference_answer: str,
expected_snippets: list[str]
) -> dict:
"""Evaluate LLM-generated answer against reference."""
# Exact Match: does the answer exactly match reference?
exact_match = generated_answer.strip() == reference_answer.strip()
# F1 Score: token-level overlap
generated_tokens = set(generated_answer.lower().split())
reference_tokens = set(reference_answer.lower().split())
precision = len(generated_tokens & reference_tokens) / len(generated_tokens) if generated_tokens else 0
recall = len(generated_tokens & reference_tokens) / len(reference_tokens) if reference_tokens else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
# ROUGE: word overlap (common in summarization)
rouge = rouge_score.RougeScorer(['rouge1', 'rougeL']).score(
reference_answer, generated_answer
)
# BERTScore: semantic similarity
precision_bert, recall_bert, f1_bert = bert_score(
[generated_answer],
[reference_answer],
lang="en"
)
# Snippet Coverage: what fraction of expected snippets appear in answer?
snippet_matches = sum(
1 for snippet in expected_snippets
if snippet.lower() in generated_answer.lower()
)
snippet_coverage = snippet_matches / len(expected_snippets) if expected_snippets else 1
return {
"exact_match": exact_match,
"f1": f1.item(),
"rouge1": rouge['rouge1'].fmeasure,
"rougeL": rouge['rougeL'].fmeasure,
"bertscore_f1": f1_bert.item(),
"snippet_coverage": snippet_coverage
}
# Example
generated = "Async/await allows concurrent I/O without threading overhead."
reference = "Async/await enables concurrent I/O operations without the overhead of OS threads."
expected = ["concurrent I/O", "threading overhead"]
scores = evaluate_answer_quality(generated, reference, expected)
print(f"F1: {scores['f1']:.3f}")
print(f"BERTScore F1: {scores['bertscore_f1']:.3f}")
print(f"Snippet Coverage: {scores['snippet_coverage']:.1%}")
| Metric | Range | Interpretation | Best For |
|---|---|---|---|
| Precision@K | [0, 1] | Fraction of top-K relevant | Strictness of ranking |
| Recall | [0, 1] | Fraction of all relevant docs found | Coverage/completeness |
| NDCG | [0, 1] | Quality of ranked list | Overall retrieval ranking |
| F1 Score | [0, 1] | Token-level answer overlap | Word-level correctness |
| BERTScore | [0, 1] | Semantic similarity of answers | Meaning-level correctness |
| Snippet Coverage | [0, 1] | Fraction of expected facts mentioned | Factual completeness |
Evaluation Best Practices
-
Separate train/test splits: Never evaluate on data you used to tune parameters. Use 70% curated data for evaluation, reserve 30% for final testing.
-
Segment by difficulty and category: Evaluate separately on easy/medium/hard queries and by domain (language, infrastructure, etc.). This reveals weak spots.
-
Measure end-to-end: Retrieval quality alone is not sufficient. Measure both retrieval and generation on the same queries to understand where the system fails.
-
Use multiple metrics: No single metric is perfect. F1 and BERTScore both measure correctness but capture different aspects. Report all; watch for disagreement.
-
Track over time: Store evaluation results with dates. Plot metrics weekly or monthly to detect regressions (e.g., retrieval quality drops after an index update).
-
Involve humans: For high-stakes systems, have humans rate answers on a scale (1–5, poor to excellent). Correlation with automated metrics reveals how well your metrics reflect real quality.
Key Takeaways
- Retrieval and generation are distinct; evaluate both separately using different metrics.
- Precision@K, Recall, and NDCG measure retrieval quality; F1, BERTScore measure answer quality.
- Curate 100–500 diverse questions with ground-truth answers for evaluation.
- Track metrics over time to detect system regressions and validate improvements.
- Segment evaluation by query difficulty and domain to identify weak spots.
Frequently Asked Questions
What is a good Precision@K value for RAG?
Precision@10 of 0.7+ is considered good; 0.5+ is acceptable. This means at least 7 of your top-10 retrieved results are relevant. Precision below 0.3 indicates serious retrieval problems.
Which metric should I optimize for: Precision or Recall?
It depends on your domain. For customer support, high recall (find all relevant docs) is critical to avoid missing solutions. For search, high precision (top results are relevant) matters most. Typically, optimize for a balanced F1 (harmonic mean of both).
Can I use LLM-based evaluation instead of human labels?
Partially. GPT-4 can rate answer quality on a rubric, often correlating with human judgment (0.8–0.9 agreement). Use LLM evaluation for fast iteration, but validate periodically on human-labeled data. LLMs themselves may hallucinate or have biases.
How often should I re-evaluate my RAG system?
Re-evaluate after any significant change (new embedding model, retriever, LLM, documents). In a stable production system, monthly evaluation is sufficient. In active development, weekly is better.
What if my test set is small (only 10 queries)?
Evaluation on tiny sets is unreliable. Aim for at least 100 questions. If manual curation is expensive, use bootstrapping: semi-automatically generate synthetic questions from your documents and validate a subset manually.
Further Reading
- RAGAS: Automatic Evaluation for RAG — open-source Python library for RAG evaluation.
- Information Retrieval Evaluation Metrics — comprehensive overview of standard IR metrics.
- BERTScore: Evaluating Text Generation with BERT — semantic similarity metric paper.
- Benchmark datasets for QA and Retrieval — Papers with Code QA benchmarks.