Skip to main content

Retrieval Metrics Explained: Precision and Recall

Retrieval metrics are the foundation of RAG evaluation. They measure how well your retriever selects relevant documents from a corpus in response to a query. Precision answers: "Of the documents you returned, how many were actually relevant?" Recall answers: "Of all relevant documents in the corpus, how many did you find?" Normalized Discounted Cumulative Gain (nDCG) further rewards placing highly relevant documents earlier in the ranking, penalizing bad orderings.

These three metrics form a complete picture: high precision means few false positives (fewer irrelevant documents), high recall means few false negatives (you found most relevant documents), and high nDCG means the ranking puts the best documents first. I discovered this distinction when debugging a failed legal RAG system that achieved 95% recall but only 60% precision—it was retrieving everything related to the query, including marginally relevant documents that confused the generator.

Precision: What Fraction of Retrieved Documents Matter?

Precision is the ratio of relevant retrieved documents to total retrieved documents. If you retrieve 10 documents and 7 are relevant, precision is 0.7. Precision directly reflects your retrieval quality from the generator's perspective: high precision means the language model sees mostly useful context, not noise.

Precision depends on the retrieval cutoff (how many documents you ask the retriever to return). At cutoff 5 you might have 4 relevant; at cutoff 10 you might have 6 relevant (precision drops to 0.6 if the extra 5 documents are mostly noise). Precision at 5 (P@5) is a standard metric for measuring shallow retrieval quality, often used in production where compute budgets limit context length.

def precision_at_k(retrieved_doc_ids: List[str], 
relevant_doc_ids: set) -> float:
"""
Compute precision at k (k = len(retrieved_doc_ids)).

Args:
retrieved_doc_ids: Ordered list of retrieved document IDs.
relevant_doc_ids: Set of ground-truth relevant document IDs.

Returns:
Precision at k: fraction of retrieved docs that are relevant.
"""
if not retrieved_doc_ids:
return 0.0

num_relevant = sum(1 for doc_id in retrieved_doc_ids
if doc_id in relevant_doc_ids)
return num_relevant / len(retrieved_doc_ids)

# Example: retrieved 5 documents, 3 are relevant
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc1', 'doc3', 'doc5'} # Ground truth
p_at_5 = precision_at_k(retrieved, relevant)
print(f"Precision@5 = {p_at_5}") # Output: 0.6

Recall: Did You Find the Documents That Exist?

Recall is the ratio of relevant retrieved documents to all relevant documents in the ground truth. If 10 relevant documents exist in the corpus and you retrieve 6 of them, recall is 0.6. Recall measures coverage: are you missing important documents that should influence the answer?

Recall is harder to measure in production because computing it requires knowing all relevant documents in the corpus (often intractable). For golden datasets, you annotate relevant documents per query, making recall computable. Recall is critical for high-stakes domains: a legal search that misses a critical precedent is worse than returning a few irrelevant documents.

def recall_at_k(retrieved_doc_ids: List[str],
relevant_doc_ids: set) -> float:
"""
Compute recall at k.

Args:
retrieved_doc_ids: Ordered list of retrieved document IDs.
relevant_doc_ids: Set of ground-truth relevant document IDs.

Returns:
Recall at k: fraction of all relevant docs that were retrieved.
"""
if not relevant_doc_ids:
return 1.0 # No relevant docs = no recall penalty

num_relevant_retrieved = sum(1 for doc_id in retrieved_doc_ids
if doc_id in relevant_doc_ids)
return num_relevant_retrieved / len(relevant_doc_ids)

# Example: corpus has 10 relevant docs total, we retrieved 6
retrieved = ['doc1', 'doc3', 'doc5', 'doc7', 'doc9', 'doc11']
relevant = {'doc1', 'doc3', 'doc5', 'doc7', 'doc9', 'doc11',
'doc2', 'doc4', 'doc6', 'doc8'} # 10 total
recall = recall_at_k(retrieved, relevant)
print(f"Recall = {recall}") # Output: 0.6

nDCG: Ranking Quality Matters

Normalized Discounted Cumulative Gain (nDCG) is a ranking-aware metric. It rewards placing highly relevant documents early and penalizes putting them deep in the list. This reflects real user behavior: users rarely scroll past the first 3–5 results. nDCG combines precision and ranking into one score.

The formula is: DCG@k = sum over positions i of (rel_i / log2(i+1)), where rel_i is the relevance score at position i (e.g., 1 for relevant, 0 for irrelevant). Then nDCG@k normalizes DCG by the "ideal" ranking (best possible score).

import math

def ndcg_at_k(retrieved_doc_ids: List[str],
relevant_doc_ids: set, k: int = 10) -> float:
"""
Compute normalized discounted cumulative gain at k.

Args:
retrieved_doc_ids: Ordered list of retrieved document IDs.
relevant_doc_ids: Set of ground-truth relevant document IDs.
k: Compute nDCG@k (default 10).

Returns:
nDCG@k: normalized ranking quality (0.0 to 1.0).
"""
# Compute DCG: sum of (relevance / log2(position))
dcg = 0.0
for i, doc_id in enumerate(retrieved_doc_ids[:k]):
rel = 1 if doc_id in relevant_doc_ids else 0
# Position is 1-indexed; discount factor is log2(i+2)
dcg += rel / math.log2(i + 2)

# Compute ideal DCG: all relevant docs ranked first
idcg = 0.0
num_relevant = min(len(relevant_doc_ids), k)
for i in range(num_relevant):
idcg += 1.0 / math.log2(i + 2)

if idcg == 0:
return 0.0

return dcg / idcg

# Example: retrieved 5, relevant 3 at positions 0, 2, 4
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc1', 'doc3', 'doc5'}
ndcg = ndcg_at_k(retrieved, relevant, k=5)
print(f"nDCG@5 = {ndcg:.3f}") # Output: ~0.742

Comparing Metrics: When to Use Each

Precision is best for production systems where context budget is tight (you can only afford 5–10 documents in the prompt). Recall is critical for exhaustive search where missing a relevant document is costly. nDCG is the most realistic metric—it reflects actual user satisfaction, as users prefer high-quality results ranked first.

In practice, use all three. A retriever with high precision but low recall is useless for comprehensive tasks. A retriever with high recall but low precision wastes compute and confuses the generator. A system with high nDCG but low precision at shallow cutoffs will fail in production.

MetricUse CaseKey Property
Precision@kLimited context budgetHigh relevance per document
RecallExhaustive searchCoverage of all relevant docs
nDCG@kRanking qualityRewards placing best results first

Key Takeaways

  • Precision measures false positives (irrelevant retrieved documents); high precision ensures the generator sees mostly signal.
  • Recall measures false negatives (missed relevant documents); high recall ensures comprehensive coverage.
  • nDCG combines precision and ranking, rewarding systems that put high-quality results early.
  • Use precision for shallow retrieval (P@5 or P@10), recall for exhaustive search, and nDCG for realistic ranking quality.
  • Report metrics at multiple cutoffs (P@5, P@10, Recall, nDCG@10) to get a complete picture.

Frequently Asked Questions

What is a "relevant" document in retrieval evaluation?

A relevant document is one that contains information needed to answer the query accurately. Relevance can be binary (relevant or not) or graded (highly relevant, somewhat relevant, not relevant). For binary evaluation, you typically annotate ground truth as a set of relevant document IDs per query.

How do I compute precision, recall, and nDCG if I don't have annotated ground truth?

Ground truth annotation is mandatory for reliable evaluation. Small datasets (50–200 examples) can be annotated by domain experts. Larger datasets often use crowdsourcing platforms (Amazon Mechanical Turk, Labeling services). For initial prototyping, auto-annotate based on keyword overlap or embedding similarity, but always hand-verify a sample.

Should I use precision@5 or precision@10?

Precision@k depends on your context window and user expectations. If your RAG context can hold 5 documents, use P@5. If you can fit 10, use P@10. For nDCG, industry standard is nDCG@10 (matching search engine conventions), but report multiple cutoffs for transparency.

Can I use mean average precision (MAP) instead of precision and recall separately?

Mean Average Precision (MAP) averages precision at each recall level, combining both metrics into one number. It is useful for summarizing overall retrieval quality but hides precision-recall trade-offs. Best practice: report both individual metrics and MAP.

Further Reading