Skip to main content

Context Relevance Metrics: Measure Retrieval Quality

Context relevance is a generation-stage metric that measures whether retrieved passages are helpful for answering the query. Unlike precision (which compares retrieved documents to a fixed ground-truth set), context relevance directly evaluates: given this query and these passages, how much do the passages contribute to a good answer?

Context relevance is critical because not all retrieved documents, even if relevant by traditional metrics, are equally helpful for generation. A document might discuss the topic but lack the specific information needed to answer the question. For example, a document about "Rust programming language history" is relevant to the query "How does Rust prevent data races?" but lacks the specific technical details needed for a good answer.

Why Context Relevance Differs from Retrieval Precision

Precision measures documents against a curated ground-truth list, which is expensive to create and may be incomplete. Context relevance skips the ground-truth comparison and directly evaluates utility: does this passage help answer this question? This is more flexible and reflects real-world deployment where ground truth is not available.

Consider a query: "What are the side effects of metformin?" Precision might count any document mentioning metformin as relevant. But context relevance would give higher scores to documents explicitly listing side effects over documents that only mention metformin in passing. Context relevance is thus more nuanced and generation-aware.

LLM-Based Context Relevance Scoring

The most direct approach is to ask a language model to score each passage's relevance to the query. Provide the query and passage, then ask the model to rate relevance on a scale (e.g., 1–5) with clear rubrics.

import json
from typing import List, Dict

def llm_context_relevance(query: str,
passages: List[str],
model_api_call) -> Dict[str, float]:
"""
Score each passage's relevance to the query using an LLM.

Args:
query: User query.
passages: Retrieved passages to evaluate.
model_api_call: Function calling your LLM.

Returns:
Dict mapping passage index to relevance score (0.0–1.0).
"""

scores = {}

for idx, passage in enumerate(passages):
prompt = f"""
Rate the relevance of the following passage to the query on a scale of 0–5:

Query: {query}

Passage: {passage}

Rubric:
5 = Directly answers the query with specific, relevant details
4 = Provides relevant information closely related to the query
3 = Tangentially relevant; provides context but lacks specifics
2 = Mentions the topic but provides limited useful information
1 = Barely related; mentions a keyword but misses the point
0 = Completely irrelevant

Respond with just a single integer (0–5).
"""

response = model_api_call(prompt).strip()
score = int(response) / 5.0 # Normalize to 0–1
scores[idx] = score

return scores

# Pseudo-code example
# query = "How do you handle errors in async Rust?"
# passages = [retrieved_passage_1, retrieved_passage_2, ...]
# scores = llm_context_relevance(query, passages, my_llm)
# print(f"Passage 0 relevance: {scores[0]:.2f}")

Semantic-Based Context Relevance

For speed and cost efficiency, compute context relevance using semantic similarity. Generate embeddings for the query and each passage, then measure cosine similarity. Higher similarity indicates higher relevance.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_context_relevance(query: str,
passages: List[str],
model_name: str = "all-MiniLM-L6-v2") -> Dict[str, float]:
"""
Score passages using semantic similarity to the query.

Args:
query: User query.
passages: Retrieved passages.
model_name: SentenceTransformer model to use.

Returns:
Dict mapping passage index to relevance score (0.0–1.0).
"""

model = SentenceTransformer(model_name)

# Generate embeddings
query_embedding = model.encode(query, convert_to_tensor=False)
passage_embeddings = model.encode(passages, convert_to_tensor=False)

# Compute cosine similarities
similarities = cosine_similarity([query_embedding], passage_embeddings)[0]

# Normalize to 0–1 (cosine ranges -1 to 1)
scores = {i: max(0.0, sim) for i, sim in enumerate(similarities)}

return scores

# Example
query = "How do you handle errors in async Rust?"
passages = [
"The Result type in Rust allows error handling. In async code, use try blocks.",
"Rust's async/await syntax is similar to JavaScript.",
"Tokio is an async runtime for Rust."
]

scores = semantic_context_relevance(query, passages)
for i, score in scores.items():
print(f"Passage {i}: {score:.3f}")

Aggregate Context Relevance Scoring

To produce a single context relevance score for a query-passage set, aggregate individual passage scores. Use mean relevance, but penalize low-relevance passages that might confuse the generator.

def aggregate_context_relevance(passage_scores: Dict[int, float],
aggregation: str = "mean") -> float:
"""
Aggregate individual passage scores into a single score.

Args:
passage_scores: Dict of passage_index -> relevance_score.
aggregation: "mean" (average), "min" (worst passage),
or "weighted" (penalize low scores).

Returns:
Single aggregated score (0.0–1.0).
"""

if not passage_scores:
return 0.0

scores = list(passage_scores.values())

if aggregation == "mean":
return sum(scores) / len(scores)

elif aggregation == "min":
# Worst passage controls overall quality
return min(scores)

elif aggregation == "weighted":
# Penalize having any irrelevant passages
# Formula: mean * (1 - penalty_for_low_scores)
mean_score = sum(scores) / len(scores)
num_low = sum(1 for s in scores if s < 0.3)
penalty = (num_low / len(scores)) * 0.5 # Up to 50% penalty
return max(0.0, mean_score * (1 - penalty))

return sum(scores) / len(scores)

# Example
passage_scores = {0: 0.9, 1: 0.6, 2: 0.2}
mean = aggregate_context_relevance(passage_scores, "mean")
worst = aggregate_context_relevance(passage_scores, "min")
weighted = aggregate_context_relevance(passage_scores, "weighted")

print(f"Mean: {mean:.3f}, Min: {worst:.3f}, Weighted: {weighted:.3f}")
# Mean: 0.567, Min: 0.200, Weighted: 0.467

Interpreting Context Relevance Scores

A high context relevance score (0.8+) indicates the retrieved passages directly answer the query. A moderate score (0.5–0.7) indicates partial relevance—passages provide context but may lack specifics. A low score (0.0–0.4) indicates the retriever returned mostly tangential documents, likely resulting in hallucinations or off-topic answers.

Use context relevance to debug retrieval failures. If a query returns low context relevance, the issue is retriever quality, not generation. If context relevance is high but faithfulness is low, the issue is generation quality.

Key Takeaways

  • Context relevance measures whether retrieved passages help answer the query, independent of ground truth.
  • LLM-based scoring is accurate but slow; semantic similarity is fast but less nuanced.
  • Aggregate passage scores using mean, min (worst passage), or weighted (penalizing irrelevance).
  • Context relevance combined with faithfulness isolates failures: low context relevance = retrieval issue; high context relevance with low faithfulness = generation issue.
  • Use context relevance for continuous monitoring; alert when it drops below a threshold.

Frequently Asked Questions

Should I use LLM-based or semantic-based context relevance scoring?

LLM-based scoring is more accurate but slower and more expensive (10–20 API calls per query). Semantic similarity is fast (milliseconds) and cheap. For production systems, use semantic scoring for continuous monitoring and LLM scoring for spot-checks on low-relevance queries.

What threshold of context relevance should I target?

Aim for 0.7+ (on a 0–1 scale) for production. Below 0.5, the retriever is likely missing key information. Context relevance naturally varies by query difficulty; easy, straightforward queries often score higher.

How does context relevance differ from faithfulness?

Context relevance measures whether passages are helpful for the query (passage utility). Faithfulness measures whether the answer is grounded in passages (answer correctness relative to sources). Both are important: high context relevance + high faithfulness = high-quality RAG.

Can context relevance scores guide reranking?

Yes. If individual passage scores vary widely (e.g., 0.9, 0.6, 0.2), rerank passages by context relevance before passing them to the generator. This improves both quality and efficiency by concentrating signal.

Further Reading