Skip to main content

Evaluating chunk quality: Benchmark and optimize RAG performance

Chunking is not art—it's engineering. Yet most RAG systems are deployed with chunking strategies chosen by intuition, not measurement. This article covers quantifying chunk quality, designing realistic benchmarks, and iteratively optimizing your chunking strategy using data. Key metrics: retrieval precision/recall, embedding coherence, and end-to-end RAG accuracy on held-out test sets.

Proper evaluation prevents costly mistakes. A poorly tuned chunking strategy can reduce RAG accuracy by 20–40%, and that cost compounds as your knowledge base grows. Establishing a benchmark early saves months of troubleshooting later. For 2025 production RAG systems, benchmarking chunking strategies is non-negotiable.

Retrieval-Level Metrics

Retrieval quality is measured by whether the right chunks are returned for a query.

Precision and Recall

from typing import List, Set

def retrieval_precision(retrieved_chunk_ids: List[str],
relevant_chunk_ids: Set[str]) -> float:
"""
Precision: what fraction of retrieved chunks are relevant?
TP / (TP + FP) = relevant_retrieved / total_retrieved
"""
if not retrieved_chunk_ids:
return 0.0

relevant_retrieved = sum(1 for cid in retrieved_chunk_ids if cid in relevant_chunk_ids)
return relevant_retrieved / len(retrieved_chunk_ids)

def retrieval_recall(retrieved_chunk_ids: List[str],
relevant_chunk_ids: Set[str]) -> float:
"""
Recall: what fraction of relevant chunks were retrieved?
TP / (TP + FN) = relevant_retrieved / total_relevant
"""
if not relevant_chunk_ids:
return 1.0 # No relevant chunks means perfect recall

relevant_retrieved = sum(1 for cid in retrieved_chunk_ids if cid in relevant_chunk_ids)
return relevant_retrieved / len(relevant_chunk_ids)

def f1_score(precision: float, recall: float) -> float:
"""Harmonic mean of precision and recall."""
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)

# Example
retrieved = ["chunk_1", "chunk_2", "chunk_5"]
relevant = {"chunk_1", "chunk_3"}

p = retrieval_precision(retrieved, relevant) # 1/3 ≈ 0.33
r = retrieval_recall(retrieved, relevant) # 1/2 = 0.50
f1 = f1_score(p, r) # ≈ 0.40

print(f"Precision: {p:.2f}, Recall: {r:.2f}, F1: {f1:.2f}")

Mean Reciprocal Rank (MRR)

MRR measures how early the first relevant chunk appears in results.

def mean_reciprocal_rank(ranking_results: List[dict]) -> float:
"""
MRR: average of 1 / (rank of first relevant result)
Used in information retrieval; high MRR means relevant results rank high.
"""
ranks = []

for result in ranking_results:
retrieved = result["retrieved_chunk_ids"]
relevant = set(result["relevant_chunk_ids"])

# Find rank of first relevant chunk (1-indexed)
first_relevant_rank = None
for i, chunk_id in enumerate(retrieved, start=1):
if chunk_id in relevant:
first_relevant_rank = i
break

if first_relevant_rank:
ranks.append(1.0 / first_relevant_rank)
else:
ranks.append(0.0) # No relevant chunk found

return sum(ranks) / len(ranks) if ranks else 0.0

# Example: 3 queries
results = [
{"query": "Q1", "retrieved_chunk_ids": ["a", "b", "c"], "relevant_chunk_ids": {"b"}},
{"query": "Q2", "retrieved_chunk_ids": ["x", "y"], "relevant_chunk_ids": {"x"}},
{"query": "Q3", "retrieved_chunk_ids": ["p", "q"], "relevant_chunk_ids": {"r"}}
]

mrr = mean_reciprocal_rank(results) # (1/2 + 1/1 + 0) / 3 ≈ 0.50
print(f"MRR: {mrr:.3f}")

Normalized Discounted Cumulative Gain (NDCG)

NDCG measures ranking quality when relevance is graded (not just binary relevant/irrelevant).

import math

def ndcg(retrieved_relevances: List[float], ideal_relevances: List[float], k: int = 10) -> float:
"""
NDCG@k: measure ranking quality for graded relevance (0=irrelevant, 1=relevant, 2=highly relevant).

Args:
retrieved_relevances: Relevance scores (0–2) of top-k retrieved chunks
ideal_relevances: Ideal relevance scores (sorted descending)
k: Cutoff (evaluate only top k results)

Returns:
NDCG score (0–1, higher is better)
"""

# Discount cumulative gain (DCG)
dcg = 0.0
for i, rel in enumerate(retrieved_relevances[:k]):
dcg += rel / math.log2(i + 2) # log2(rank + 1)

# Ideal DCG (best possible ranking)
ideal_dcg = 0.0
for i, rel in enumerate(ideal_relevances[:k]):
ideal_dcg += rel / math.log2(i + 2)

# Normalized
if ideal_dcg == 0:
return 0.0
return dcg / ideal_dcg

# Example: graded relevance
retrieved = [2, 1, 0, 2, 1] # Relevance scores of top-5 results
ideal = [2, 2, 1, 1, 0] # Ideal: all 2s first, then 1s

ndcg_score = ndcg(retrieved, ideal, k=5)
print(f"NDCG@5: {ndcg_score:.3f}")

Chunk-Level Metrics

Metrics for individual chunk quality.

Semantic Coherence

A good chunk should be semantically coherent (idea is unified, not scattered).

import numpy as np
from typing import Callable

def semantic_coherence(text: str, embed_fn: Callable, sentence_splitter: Callable) -> float:
"""
Semantic coherence: average similarity between consecutive sentences.
High coherence means sentences flow together; low means disparate ideas.

Args:
text: Chunk text
embed_fn: Function to embed text
sentence_splitter: Function to split text into sentences

Returns:
Coherence score (0–1)
"""
sentences = sentence_splitter(text)

if len(sentences) < 2:
return 1.0 # Single sentence is trivially coherent

# Embed sentences
embeddings = [embed_fn(s) for s in sentences]

# Compute similarities between consecutive sentences
similarities = []
for i in range(len(embeddings) - 1):
sim = np.dot(embeddings[i], embeddings[i + 1]) / \
(np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1]))
similarities.append(sim)

# Average similarity
coherence = np.mean(similarities) if similarities else 1.0

return max(0.0, min(1.0, coherence)) # Clip to [0, 1]

# Example
text = "Machine learning is powerful. It learns from data. Neural networks are inspired by brains."
coherence = semantic_coherence(text, embed_fn, sentence_splitter)
print(f"Coherence: {coherence:.3f}")

Completeness

A complete chunk should be self-contained and answerable without external context.

def completeness_heuristic(text: str, min_token_count: int = 50) -> float:
"""
Heuristic for completeness:
- Contains multiple sentences (not fragment)
- Has intro, body, or conclusion structure
- Doesn't rely heavily on pronouns without antecedents

Returns completeness score (0–1).
"""
sentences = text.split('.')
sentence_count = len([s for s in sentences if s.strip()])

# Multiple sentences is good
if sentence_count < 2:
return 0.3
elif sentence_count < 4:
return 0.6
else:
return 0.9

# Check for orphan pronouns (This, These, That, It, They without context)
import re
orphan_pronouns = len(re.findall(r'^\s*(This|These|That|It|They|Its|Their)', text, re.MULTILINE))

if orphan_pronouns > 2:
return max(0.3, 0.9 - (orphan_pronouns * 0.1))

return 0.9

# Example
chunk = "Machine learning is a subset of AI. It enables systems to learn from data without explicit programming."
completeness = completeness_heuristic(chunk)
print(f"Completeness: {completeness:.2f}")

End-to-End RAG Benchmarking

The ultimate test: does a chunking strategy improve LLM accuracy on questions?

from typing import Callable
from dataclasses import dataclass

@dataclass
class RAGBenchmarkResult:
chunking_strategy: str
precision: float
recall: float
f1: float
avg_latency_ms: float
qa_accuracy: float # Exact match on question-answer pairs
qa_f1: float # Token F1 on generated answers

def benchmark_chunking_strategy(strategy_name: str,
documents: List[str],
chunk_fn: Callable,
embed_fn: Callable,
retrieval_fn: Callable,
qa_fn: Callable, # LLM question-answering
eval_set: List[dict]) -> RAGBenchmarkResult:
"""
Complete benchmark: chunk, embed, retrieve, answer, evaluate.

Args:
strategy_name: Name of chunking strategy (e.g., "fixed-512", "semantic-0.5")
documents: List of source documents
chunk_fn: Chunking function
embed_fn: Embedding function
retrieval_fn: Retrieval function
qa_fn: QA function (takes query + context, returns answer)
eval_set: List of {query, expected_answer} dicts

Returns:
Benchmark results
"""
import time

# Step 1: Chunk documents
chunks = []
for doc in documents:
doc_chunks = chunk_fn(doc)
chunks.extend(doc_chunks)

# Step 2: Embed chunks
chunk_embeddings = {}
for chunk in chunks:
chunk_embeddings[chunk["chunk_id"]] = embed_fn(chunk["text"])

# Step 3: Evaluate on test set
retrieval_scores = {"precision": [], "recall": [], "f1": []}
qa_scores = {"exact_match": [], "f1": []}
retrieval_times = []

for eval_item in eval_set:
query = eval_item["query"]
expected_answer = eval_item["expected_answer"]
relevant_chunk_ids = set(eval_item.get("relevant_chunk_ids", []))

# Retrieve
start_time = time.time()
retrieved = retrieval_fn(query, chunk_embeddings, top_k=5)
elapsed_ms = (time.time() - start_time) * 1000
retrieval_times.append(elapsed_ms)

retrieved_ids = [chunk["chunk_id"] for chunk in retrieved]

# Compute retrieval metrics
p = retrieval_precision(retrieved_ids, relevant_chunk_ids)
r = retrieval_recall(retrieved_ids, relevant_chunk_ids)
f1 = f1_score(p, r)

retrieval_scores["precision"].append(p)
retrieval_scores["recall"].append(r)
retrieval_scores["f1"].append(f1)

# Generate answer using retrieved context
context = " ".join([chunk["text"] for chunk in retrieved])
generated_answer = qa_fn(query, context)

# Evaluate answer quality (token F1)
token_f1 = compute_token_f1(expected_answer, generated_answer)
exact_match = 1.0 if expected_answer.lower() == generated_answer.lower() else 0.0

qa_scores["exact_match"].append(exact_match)
qa_scores["f1"].append(token_f1)

# Aggregate results
return RAGBenchmarkResult(
chunking_strategy=strategy_name,
precision=np.mean(retrieval_scores["precision"]),
recall=np.mean(retrieval_scores["recall"]),
f1=np.mean(retrieval_scores["f1"]),
avg_latency_ms=np.mean(retrieval_times),
qa_accuracy=np.mean(qa_scores["exact_match"]),
qa_f1=np.mean(qa_scores["f1"])
)

def compute_token_f1(reference: str, hypothesis: str) -> float:
"""F1 score based on token overlap (simple BLEU-like metric)."""
ref_tokens = set(reference.lower().split())
hyp_tokens = set(hypothesis.lower().split())

if not ref_tokens and not hyp_tokens:
return 1.0
if not ref_tokens or not hyp_tokens:
return 0.0

overlap = len(ref_tokens & hyp_tokens)
precision = overlap / len(hyp_tokens)
recall = overlap / len(ref_tokens)

if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)

# Example: compare strategies
strategies = [
{"name": "fixed-512", "chunk_fn": lambda doc: split_fixed_size(doc, 512)},
{"name": "recursive", "chunk_fn": lambda doc: split_recursive(doc)},
{"name": "semantic", "chunk_fn": lambda doc: semantic_chunking(doc, embed_fn)}
]

results = []
for strategy in strategies:
result = benchmark_chunking_strategy(
strategy_name=strategy["name"],
documents=documents,
chunk_fn=strategy["chunk_fn"],
embed_fn=embed_fn,
retrieval_fn=retrieve_chunks,
qa_fn=qa_model,
eval_set=eval_set
)
results.append(result)
print(f"\n{result.chunking_strategy}:")
print(f" Retrieval F1: {result.f1:.3f}")
print(f" QA Accuracy: {result.qa_accuracy:.3f}")
print(f" Latency: {result.avg_latency_ms:.1f}ms")

Building a Benchmark Dataset

A good evaluation set is essential. Representative, diverse, with ground truth annotations.

def create_benchmark_set(documents: List[str],
annotation_fn: Callable, # Human annotator or expert
num_queries: int = 50) -> List[dict]:
"""
Create a benchmark evaluation set:
- Sample queries from documents
- Annotate with ground-truth answers and relevant chunks

Args:
documents: Source documents
annotation_fn: Function to annotate queries
num_queries: Number of queries to generate

Returns:
List of evaluation items
"""

eval_set = []

for doc_idx, doc in enumerate(documents[:len(documents) // 10]): # Sample docs
# Extract potential queries from section headers, summaries, etc.
# This is heuristic; better: hire annotators

potential_queries = extract_queries_heuristic(doc)

for query in potential_queries[:num_queries // len(documents)]:
# Annotate
annotation = annotation_fn(query, doc)

eval_set.append({
"query": query,
"expected_answer": annotation.get("answer"),
"relevant_chunk_ids": annotation.get("chunk_ids"),
"source_doc": doc_idx,
"difficulty": annotation.get("difficulty", "medium")
})

return eval_set

def extract_queries_heuristic(text: str) -> List[str]:
"""Extract potential questions from document (heuristic)."""
import re

queries = []

# Find section headers and turn into questions
headers = re.findall(r'^#{1,3}\s+(.+?)$', text, re.MULTILINE)
for header in headers:
queries.append(f"What is {header.lower()}?")
queries.append(f"How does {header.lower()} work?")

return queries[:10]

Key Takeaways

  • Measure chunking quality with retrieval metrics: precision, recall, F1, MRR, NDCG.
  • Evaluate individual chunks using semantic coherence and completeness scores.
  • Run end-to-end benchmarks on realistic QA eval sets to measure impact on RAG accuracy.
  • Different chunking strategies have different latency/quality tradeoffs; benchmark all candidates.
  • Invest in a small, high-quality labeled eval set (50–100 Q&A pairs); it pays dividends.

Frequently Asked Questions

How many eval examples do I need to benchmark reliably?

Minimum 50 queries for rough benchmarking. For production decisions, 200–500 queries. For academic papers, 1,000+. Each additional 100 queries reduces noise and increases confidence in results.

Should I benchmark on documents in my training set?

No. Eval set must be held-out (documents unseen during development). Otherwise, you optimize for memorization, not generalization. Split documents: 80% dev, 20% held-out eval.

How do I handle eval sets with multiple correct answers?

Use token F1 instead of exact match. For BLEU/ROUGE, compare against multiple reference answers. For graded relevance, have annotators assign 0–2 relevance scores per chunk, then use NDCG.

Is latency important for chunking evaluation?

Yes, but secondary. Prioritize precision/recall first, then optimize latency. A faster chunking method that reduces accuracy by 10% is a bad trade. Measure both; pareto-optimize.

Can I use synthetic eval sets (LLM-generated Q&A)?

Synthetic sets are useful for iteration but unreliable for final decisions. They typically inflate scores by 10–20%. Use synthetic for fast iteration; human annotation for final benchmarks.

Further Reading