Skip to main content

Measuring Retrieval: Recall, Precision, NDCG

Retrieval quality is measured by recall, precision, and ranking metrics like NDCG and MRR that capture whether your vector search returns the right documents in the right order. Recall@10 (fraction of relevant documents appearing in the top 10) is the most common metric for RAG; if recall is below 0.85, the LLM will not find enough context to answer accurately. Precision@k answers whether retrieved results are relevant (avoid noise). NDCG (normalized discounted cumulative gain) rewards putting most-relevant documents higher. Benchmarking on 100-500 labeled query-document pairs is essential before deploying any embedding model or vector index to production. In my experience, teams who skip benchmarking ship 30% of RAG systems that hallucinate due to retrieval failures. This article teaches rigorous measurement.

Understanding Core Metrics

Recall@K

Recall@K is the fraction of all relevant documents that appear in the top K results:

Recall@K = (number of relevant documents in top-K) / (total number of relevant documents)

Example: Query "best dogs for apartments" has 5 relevant documents in the corpus. Your index returns top-10 results; 4 of the 5 relevant docs appear. Recall@10 = 4/5 = 0.80.

Interpretation:

  • Recall@10 = 0.95: Excellent. 95% of relevant documents retrieved in top 10.
  • Recall@10 = 0.85: Good. 85% of relevant docs retrieved.
  • Recall@10 = 0.70: Poor. Miss 30% of relevant documents.
  • Recall@10 = 0.50: Failure. Miss half of relevant documents; LLM context is incomplete.

Recall is the PRIMARY metric for RAG. If recall is low, the LLM cannot synthesize a good answer because it is missing relevant context.

Precision@K

Precision@K is the fraction of top-K results that are actually relevant:

Precision@K = (number of relevant documents in top-K) / K

Using the same example: top-10 results include 4 relevant docs + 6 irrelevant. Precision@10 = 4/10 = 0.40.

Interpretation:

  • Precision@10 = 0.90: Excellent. 90% of top-10 results are relevant; little noise.
  • Precision@10 = 0.70: Good. 70% relevant, 30% noise. Acceptable for some use cases.
  • Precision@10 = 0.50: 50% noise. LLM will filter out irrelevant results, but slower.
  • Precision@10 < 0.30: Failure. Mostly noise; LLM struggles to synthesize answers.

For RAG, recall matters more than precision. A few irrelevant documents are filtered by the LLM; missing relevant documents breaks the answer.

NDCG (Normalized Discounted Cumulative Gain)

NDCG measures how well a ranking orders documents, rewarding relevant documents at the top and penalizing them lower.

DCG@K = rel_1 + rel_2 / log2(2) + rel_3 / log2(3) + ... + rel_K / log2(K)

where rel_i = 1 if document at position i is relevant, 0 otherwise.

IDCG@K = ideal DCG if all most-relevant docs were ranked first

NDCG@K = DCG@K / IDCG@K (normalized to 0-1)

Example:

import numpy as np

# Top-10 results: which are relevant?
relevance = [1, 1, 0, 1, 0, 0, 0, 1, 0, 0] # 4 relevant docs

# Calculate DCG
dcg = 0
for i, rel in enumerate(relevance, start=1):
if rel == 1:
dcg += 1 / np.log2(i + 1)
print(f"DCG@10: {dcg:.3f}") # 1 + 1/log2(3) + 1/log2(4) + 1/log2(8) ≈ 2.39

# Ideal ranking: all 4 relevant docs at top
ideal_relevance = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
idcg = 0
for i, rel in enumerate(ideal_relevance, start=1):
if rel == 1:
idcg += 1 / np.log2(i + 1)
print(f"IDCG@10: {idcg:.3f}") # 1 + 1/log2(3) + 1/log2(4) + 1/log2(5) ≈ 2.90

# NDCG
ndcg = dcg / idcg
print(f"NDCG@10: {ndcg:.3f}") # 2.39 / 2.90 ≈ 0.82

NDCG penalizes poor ranking order. A system that retrieves all relevant docs but in wrong order scores lower NDCG than one that ranks them well.

Interpretation:

  • NDCG@10 > 0.90: Excellent ranking.
  • NDCG@10 > 0.80: Good ranking.
  • NDCG@10 > 0.70: Acceptable.
  • NDCG@10 < 0.60: Poor ranking order.

For RAG, NDCG is secondary to recall (any relevant document helps, order matters less). For search engines (Google, Bing), NDCG is primary.

Mean Reciprocal Rank (MRR)

MRR measures how far down the ranking you must go to find the first relevant document:

MRR = (1 / N) * sum of (1 / rank_of_first_relevant_doc_i) for each query i

Example: Query 1 has first relevant doc at rank 3, Query 2 at rank 1, Query 3 at rank 8:

MRR = (1/3 + 1/1 + 1/8) / 3 ≈ 0.47

Interpretation:

  • MRR > 0.90: First relevant doc is typically in top-2 results.
  • MRR > 0.70: First relevant doc is typically in top-3.
  • MRR > 0.50: First relevant doc is typically in top-5.
  • MRR < 0.33: Often takes 10+ results to find first relevant.

MRR is useful for systems where finding ANY relevant result quickly matters (e.g., search with immediate clicks).

Benchmarking Framework

Here is a complete evaluation script:

import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Load corpus and queries with ground truth
with open("corpus.json") as f:
corpus = json.load(f)

with open("queries.json") as f:
queries = json.load(f)

with open("ground_truth.json") as f:
ground_truth = json.load(f)
# Format: {query_id: [list of relevant doc_ids]}

# Encode corpus
corpus_embeddings = model.encode(corpus, normalize_embeddings=True, show_progress_bar=True)

# Compute metrics
def compute_metrics(queries, corpus_embeddings, ground_truth):
recalls = {10: [], 50: []}
precisions = {10: [], 50: []}
ndcgs = {10: [], 50: []}
mrrs = []

for query_id, query_text in enumerate(queries):
# Encode query
query_embedding = model.encode(query_text, normalize_embeddings=True)

# Retrieve top-50
similarities = corpus_embeddings @ query_embedding
top_50_indices = np.argsort(similarities)[-50:][::-1]

# Get ground truth
relevant_ids = set(ground_truth.get(str(query_id), []))

if not relevant_ids:
continue # Skip queries with no ground truth

# Compute metrics for K=10 and K=50
for k in [10, 50]:
top_k_indices = top_50_indices[:k]
relevant_in_k = len(set(top_k_indices) & relevant_ids)

# Recall@K
recall_k = relevant_in_k / len(relevant_ids)
recalls[k].append(recall_k)

# Precision@K
precision_k = relevant_in_k / k
precisions[k].append(precision_k)

# NDCG@K
dcg = 0
for rank, doc_id in enumerate(top_k_indices, start=1):
if doc_id in relevant_ids:
dcg += 1 / np.log2(rank + 1)

# Ideal DCG
num_relevant = min(len(relevant_ids), k)
idcg = sum(1 / np.log2(i + 1) for i in range(1, num_relevant + 1))

ndcg_k = dcg / idcg if idcg > 0 else 0
ndcgs[k].append(ndcg_k)

# MRR: rank of first relevant document
for rank, doc_id in enumerate(top_50_indices, start=1):
if doc_id in relevant_ids:
mrrs.append(1 / rank)
break
else:
mrrs.append(0) # No relevant document in top-50

# Aggregate
results = {
"num_queries": len(queries),
"recall@10": np.mean(recalls[10]),
"recall@50": np.mean(recalls[50]),
"precision@10": np.mean(precisions[10]),
"precision@50": np.mean(precisions[50]),
"ndcg@10": np.mean(ndcgs[10]),
"ndcg@50": np.mean(ndcgs[50]),
"mrr": np.mean(mrrs) if mrrs else 0,
}

return results

# Evaluate
metrics = compute_metrics(queries, corpus_embeddings, ground_truth)

# Print results
print(f"Evaluation Results ({metrics['num_queries']} queries):")
print(f" Recall@10: {metrics['recall@10']:.3f}")
print(f" Recall@50: {metrics['recall@50']:.3f}")
print(f" Precision@10: {metrics['precision@10']:.3f}")
print(f" Precision@50: {metrics['precision@50']:.3f}")
print(f" NDCG@10: {metrics['ndcg@10']:.3f}")
print(f" NDCG@50: {metrics['ndcg@50']:.3f}")
print(f" MRR: {metrics['mrr']:.3f}")

# Output example:
# Evaluation Results (100 queries):
# Recall@10: 0.852
# Recall@50: 0.921
# Precision@10: 0.085
# Precision@50: 0.018
# NDCG@10: 0.781
# NDCG@50: 0.832
# MRR: 0.687

Creating Ground Truth Labels

Labeling queries with relevant documents is essential but tedious. Best practices:

Option 1: Domain Expert Labeling

Hire subject-matter experts to label 200–500 representative queries. Cost: $3,000–10,000. Quality: High.

Option 2: Crowdsourcing

Use services like Amazon Mechanical Turk, Scale AI, or Labeled Data. Cost: $1–5 per query. Quality: Medium to high (depends on instructions).

Option 3: Weak Supervision

Use heuristics:

  • BM25 (keyword) ranking: labels from keyword hits.
  • User click data: labels from production search logs.
  • LLM relevance judgment: use GPT-4 to label (cheaper than experts, decent quality).

Example weak labeling with LLM:

import openai

def label_with_llm(query, doc, model="gpt-4"):
response = openai.ChatCompletion.create(
model=model,
messages=[
{
"role": "user",
"content": f"""Is this document relevant to the query?

Query: {query}

Document: {doc}

Answer only 'Yes' or 'No'."""
}
]
)

label = 1 if "yes" in response["choices"][0]["message"]["content"].lower() else 0
return label

# Label top-100 retrieved docs per query
for query_id, query in enumerate(queries):
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = corpus_embeddings @ query_embedding
top_100_indices = np.argsort(similarities)[-100:][::-1]

relevant_indices = []
for doc_id in top_100_indices:
if label_with_llm(query, corpus[doc_id]) == 1:
relevant_indices.append(doc_id)

ground_truth[str(query_id)] = relevant_indices

Cost: ~$0.10–0.20 per query (GPT-4 API pricing). Quality: ~85% (acceptable for initial benchmarks).

Interpreting Benchmark Results

A good benchmark for RAG:

MetricTargetInterpretation
Recall@10> 0.85Sufficient context for LLM to answer
Precision@10> 0.50At least half of top-10 are relevant
NDCG@10> 0.80Ranking order is reasonable
MRR> 0.70First relevant doc typically in top-3

If recall@10 < 0.80:

  • Increase embedding model size (text-embedding-3-large).
  • Fine-tune on domain data.
  • Increase chunk size (more context per doc).

If precision@10 < 0.40:

  • Use a re-ranker (second-pass ranking with a learned model).
  • Filter results by metadata before returning.

A/B Testing in Production

Before switching embedding models or indexes, A/B test on real traffic:

# Old index (baseline)
old_results = old_index.search(query, k=10)

# New index (candidate)
new_results = new_index.search(query, k=10)

# Track metrics per query
old_recall = compute_recall(old_results, ground_truth[query_id])
new_recall = compute_recall(new_results, ground_truth[query_id])

# If new recall > old recall by > 3%, promote new index
if np.mean(new_recalls) > np.mean(old_recalls) * 1.03:
promote_new_index_to_production()

Key Takeaways

  • Recall@10 is the primary metric for RAG. Target > 0.85.
  • Precision@10 measures noise. Target > 0.50 (depends on downstream filtering).
  • NDCG@10 measures ranking order. Target > 0.80 for good ranking.
  • Benchmark on 200+ queries with ground-truth labels before deploying.
  • MRR is useful for search; less critical for RAG (any relevant doc helps).

Frequently Asked Questions

How many ground-truth labels do I need?

Minimum 100 queries; 500+ is ideal for statistical confidence. With 100 queries, assume ±5% metric variance. With 500, assume ±2%.

Can I evaluate without ground truth?

Approximately, by using a reference index (e.g., HNSW) as ground truth and measuring overlap with candidate index (e.g., IVF+PQ). Not as accurate as manual labels, but faster.

How often should I re-evaluate?

After any major change (new embedding model, new corpus, new index). For production, quarterly benchmarks are standard. Monthly if iterating rapidly.

What if metrics are good but users complain about results?

User feedback often reflects different criteria (e.g., users want recent documents, but metrics measure relevance-only). Add metadata filters (date, source, category) and re-evaluate.

How do I improve recall from 0.75 to 0.90?

Options, in order of impact: (1) Use a larger model (text-embedding-3-large). (2) Fine-tune on domain data. (3) Increase chunk size (more context per document). (4) Increase k (retrieve more results, but slower). (5) Combine multiple embeddings (ensemble).

Further Reading