Chunk overlap and metadata: Design retrieval-friendly document segments
A chunk is not just text—it's text plus metadata. The metadata enriches chunks with context: what document they came from, what section, what entities they mention, what their semantic tags are. This metadata enables post-retrieval ranking, cross-document filtering, and source attribution. Overlap (the reappearance of end-of-chunk-K as start-of-chunk-K+1) prevents context loss at chunk boundaries, critical for complex reasoning tasks.
Design decisions about overlap and metadata account for 8–15% of RAG precision gains. This article covers designing metadata schemas, implementing overlap correctly, and practical strategies for filtering and ranking retrieved chunks to maximize relevance.
Understanding Overlap and Why It Matters
Overlap solves a fundamental problem: when documents are split into independent chunks, a concept that spans a boundary gets split, and neither chunk alone contains enough context. Overlap reintroduces boundary context.
# Example: no overlap (concept split)
chunk_0: "The customer complained about billing errors. "
chunk_1: "These errors appeared in Q3 and Q4. The support team..."
# The statement "errors appeared in Q3 and Q4" lacks context about what errors
# (billing errors). Queries about Q3 errors might retrieve chunk_1 but miss intent.
# With 50% overlap:
chunk_0: "The customer complained about billing errors. "
chunk_1: "The customer complained about billing errors. These errors appeared in Q3 and Q4..."
# Now chunk_1 retains the context "billing errors", improving retrieval.
Overlap percentage is typically 10–50% of chunk size. A 512-token chunk with 25% overlap has 128 overlapping tokens shared with adjacent chunks.
import tiktoken
def apply_overlap(chunks: list[str],
chunk_size: int = 512,
overlap_percent: float = 0.25,
encoding_name: str = "cl100k_base") -> list[dict]:
"""
Apply overlap to chunks: the end of chunk N appears at the start of chunk N+1.
Args:
chunks: List of text chunks (already split)
chunk_size: Size of each chunk in tokens (for validation)
overlap_percent: Fraction of chunk_size to overlap (0.0–1.0)
encoding_name: Tokenizer name (default: GPT-4)
Returns:
List of dicts with overlapped text and metadata
"""
encoding = tiktoken.get_encoding(encoding_name)
overlap_tokens = int(chunk_size * overlap_percent)
result = []
for i, chunk in enumerate(chunks):
if i == 0:
# First chunk: no prefix overlap
result.append({
"text": chunk,
"chunk_idx": i,
"is_first": True,
"is_last": (i == len(chunks) - 1)
})
else:
# Subsequent chunks: prepend overlap from previous chunk
prev_chunk = chunks[i - 1]
prev_tokens = encoding.encode(prev_chunk)
# Get last `overlap_tokens` from previous chunk
overlap_start_idx = max(0, len(prev_tokens) - overlap_tokens)
overlap_tokens_list = prev_tokens[overlap_start_idx:]
overlap_text = encoding.decode(overlap_tokens_list)
# Prepend overlap to current chunk
overlapped_chunk = overlap_text + chunk
result.append({
"text": overlapped_chunk,
"chunk_idx": i,
"overlap_prefix": overlap_text,
"is_first": False,
"is_last": (i == len(chunks) - 1),
"overlap_token_count": len(overlap_tokens_list)
})
return result
Designing a Metadata Schema
Metadata should support: filtering, ranking, source attribution, and hierarchical navigation.
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime
@dataclass
class ChunkMetadata:
"""Rich metadata schema for RAG chunks."""
# Source information
source_doc_id: str # Unique document ID
source_filename: str # Original filename
source_format: str # PDF, HTML, Markdown, etc.
source_url: Optional[str] # Original URL if applicable
source_created: Optional[datetime] # Document creation date
# Hierarchical structure
section_title: Optional[str] # Current section (H2 or higher)
subsection_title: Optional[str] # Subsection (H3, H4, etc.)
section_level: int # Nesting depth
# Content attributes
content_type: str # "text", "code", "table", "image", "mixed"
is_summary: bool # Is this a summary/abstract chunk?
contains_code: bool # Does chunk include code blocks?
contains_math: bool # Does chunk include equations?
# Semantic attributes
entities: List[str] # Named entities (people, orgs, locations)
keywords: List[str] # Key terms for the chunk
domain: Optional[str] # Domain (finance, healthcare, tech, etc.)
language: str # Language code (en, fr, zh, etc.)
# Quality and context
is_complete: bool # Is chunk semantically complete?
chunk_coherence_score: Optional[float] # Semantic coherence (0–1)
word_count: int
token_count: int
# Retrieval hints
summary: Optional[str] # One-line summary for LLM prompt
embeddings_model: str # Which embedding model was used
# Example: building a chunk with rich metadata
def build_chunk_with_metadata(text: str,
source_doc_id: str,
section_title: str,
entities: List[str],
embedding: List[float]) -> dict:
"""
Create a chunk dict with full metadata for indexing.
"""
metadata = ChunkMetadata(
source_doc_id=source_doc_id,
source_filename="whitepaper_v2.pdf",
source_format="pdf",
section_title=section_title,
subsection_title=None,
section_level=2,
content_type="text",
is_summary=False,
contains_code=False,
contains_math=False,
entities=entities,
keywords=["RAG", "chunking", "embeddings"],
domain="ai",
language="en",
is_complete=True,
word_count=len(text.split()),
token_count=len(text) // 4, # Rough estimate
summary="Discusses RAG chunking strategies and their impact on retrieval quality.",
embeddings_model="text-embedding-3-large"
)
return {
"text": text,
"metadata": metadata.__dict__,
"embedding": embedding, # Store for vector search
"chunk_id": f"{source_doc_id}#{section_title}"
}
Filtering and Ranking with Metadata
Metadata enables intelligent retrieval: filter to relevant documents, then rank by quality signals.
from typing import List, Callable
import numpy as np
def retrieve_with_metadata_filtering(query: str,
query_embedding: np.ndarray,
all_chunks: List[dict],
similarity_fn: Callable = None,
filters: dict = None,
top_k: int = 10) -> List[dict]:
"""
Retrieve chunks, applying metadata filters and ranking.
Args:
query: User query (for logging)
query_embedding: Embedding of the query
all_chunks: All indexed chunks with metadata
similarity_fn: Function to compute similarity (e.g., cosine)
filters: Dict of metadata filters, e.g. {"domain": "finance", "language": "en"}
top_k: Number of chunks to return
Returns:
Top-k chunks, ranked by relevance
"""
if similarity_fn is None:
def similarity_fn(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
# Step 1: Apply metadata filters
filtered_chunks = all_chunks
if filters:
for key, value in filters.items():
if key == "domain":
filtered_chunks = [c for c in filtered_chunks
if c.get("metadata", {}).get("domain") == value]
elif key == "language":
filtered_chunks = [c for c in filtered_chunks
if c.get("metadata", {}).get("language") == value]
elif key == "content_type":
filtered_chunks = [c for c in filtered_chunks
if c.get("metadata", {}).get("content_type") == value]
elif key == "source_doc_id":
filtered_chunks = [c for c in filtered_chunks
if c.get("metadata", {}).get("source_doc_id") == value]
if not filtered_chunks:
print(f"⚠ No chunks match filter criteria: {filters}")
filtered_chunks = all_chunks # Fallback
# Step 2: Score by similarity
scored_chunks = []
for chunk in filtered_chunks:
chunk_embedding = np.array(chunk.get("embedding", []))
if len(chunk_embedding) == 0:
continue
similarity = similarity_fn(query_embedding, chunk_embedding)
# Boost score based on metadata signals
boost = 1.0
if chunk.get("metadata", {}).get("is_summary"):
boost *= 1.2 # Summaries are often more relevant
if "code" in chunk.get("metadata", {}).get("content_type", ""):
boost *= 0.9 # Slightly lower code blocks (less readable)
# Penalize if from different domain/language
if filters and filters.get("domain") and \
chunk.get("metadata", {}).get("domain") != filters.get("domain"):
boost *= 0.8
scored_chunks.append({
"chunk": chunk,
"similarity": similarity,
"boost": boost,
"final_score": similarity * boost
})
# Step 3: Rank and return top-k
ranked = sorted(scored_chunks, key=lambda x: x["final_score"], reverse=True)
result = []
for item in ranked[:top_k]:
chunk = item["chunk"]
chunk["retrieval_score"] = item["final_score"]
result.append(chunk)
return result
# Usage example
query = "How does semantic chunking improve RAG quality?"
filters = {"domain": "ai", "language": "en"}
top_chunks = retrieve_with_metadata_filtering(query, query_embedding, all_chunks, filters=filters, top_k=5)
for chunk in top_chunks:
print(f"Score: {chunk['retrieval_score']:.3f}")
print(f"Section: {chunk['metadata'].get('section_title')}")
print(f"Entities: {chunk['metadata'].get('entities')}")
print(chunk["text"][:200] + "...\n")
Entity Extraction for Semantic Metadata
Identify and tag named entities to enable entity-based retrieval.
def extract_entities_simple(text: str) -> List[str]:
"""
Simple regex-based entity extraction.
For production, use spaCy or transformers-based NER.
"""
import re
entities = []
# Capitalized phrases (heuristic for named entities)
capitalized = re.findall(r'\b([A-Z][a-z]+ )+[A-Z][a-z]*\b', text)
entities.extend(capitalized)
# Acronyms (A.I., ML, NLP, etc.)
acronyms = re.findall(r'\b([A-Z]{2,})\b', text)
entities.extend(acronyms)
# Remove duplicates and very common words
entities = list(set(entities))
common_words = {"The", "This", "That", "What", "How", "Why"}
entities = [e for e in entities if e not in common_words]
return entities
# Production: use spaCy for better extraction
def extract_entities_spacy(text: str) -> List[str]:
"""Entity extraction using spaCy NER."""
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
entities = []
for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG", "GPE", "PRODUCT", "DATE"]:
entities.append(ent.text)
return entities
Designing Overlap for Different Chunk Strategies
| Strategy | Recommended Overlap | Rationale |
|---|---|---|
| Fixed-size | 25–30% (128–154 tokens) | Prevents sentence splitting at boundaries |
| Recursive | 15–20% | Boundaries already aligned to structure |
| Semantic | 10–15% | Boundaries already align with meaning |
| Code | 5–10% | Function/class boundaries are natural |
Key Takeaways
- Overlap (10–50% of chunk size) prevents context loss at boundaries, improving retrieval precision by 3–8%.
- Rich metadata (source, section, entities, content type) enables filtering, ranking, and source attribution.
- Metadata boosts allow post-retrieval ranking: prioritize summaries, domain-relevant chunks, and complete segments.
- Entity extraction (spaCy, regex) adds semantic granularity to metadata.
- Test metadata schema on your specific use case; every domain has different priorities.
Frequently Asked Questions
How much overlap is ideal?
25–30% is standard for most use cases. Measure retrieval precision with different overlap values on your corpus; typically, precision improves 3–5% from 0% to 30% overlap, then plateaus.
Should I include metadata in the embedding?
No. Embed only the text content. Metadata is used for filtering and ranking post-retrieval, not for vector search. Mixing metadata into embeddings dilutes their semantic signal.
How do I extract entities for low-resource languages?
For non-English, use spaCy models in other languages (spacy-models include multi-language support) or multilingual transformers (mBERT, XLM-RoBERTa). For very low-resource languages, fall back to regex (capitalized phrases, acronyms).
Can I use metadata to avoid irrelevant documents?
Yes. Apply hard filters in retrieval: if document_domain != "finance": skip. Soft filters (boosts) are safer—they re-rank but don't exclude. Hard filters risk missing relevant results if categorization is imprecise.
What metadata is essential vs. nice-to-have?
Essential: source_doc_id, section_title, token_count. Nice-to-have: entities, keywords, domain, coherence_score. Start minimal; add metadata fields if they improve your ranking metrics.