Skip to main content

Chunking Your Documents for RAG Success

Document chunking is the art of splitting raw text into discrete, retrievable units that balance context retention with retrieval precision. Poor chunking is the #1 reason RAG systems fail in production. A chunk that is too small loses surrounding context, making it difficult for a language model to reason about the answer; a chunk that is too large dilutes relevance signals and bloats the retrieval latency. This article covers the science and practice of chunking, including empirical guidelines, advanced splitting strategies, and code patterns you can use immediately.

Why Chunking Matters

When you index a full 50-page document as a single vector, retrieval becomes binary: either the entire document matches the query or it doesn't. If only one paragraph on page 15 is relevant, you either retrieve all 50 pages (wasting tokens and latency) or retrieve nothing (missing the relevant section). Chunking solves this by decomposing documents into overlapping segments, each small enough to retrieve independently but large enough to be self-contained. A 2024 LLM benchmark (by researchers at Stanford NLP) found that optimal chunk size improved retrieval accuracy by 28–40% compared to full-document indexing. Chunk size also directly impacts cost: every token sent to an LLM incurs compute time; oversized chunks multiply latency and cost.

Fixed-Size Chunking: The Naive Baseline

The simplest approach is fixed-size chunking: split every document into chunks of exactly N tokens (or words), often with a sliding-window overlap. This is easy to implement and deterministic but blind to document structure.

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 128) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
tokens = text.split() # Naive: word-based splitting
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = ' '.join(tokens[i:i + chunk_size])
if chunk.strip():
chunks.append(chunk)
return chunks

# Example: 512-token chunks with 128-token overlap
doc = "The CAP theorem states that distributed systems..."
chunks = fixed_size_chunk(doc, chunk_size=512, overlap=128)
print(f"Created {len(chunks)} chunks")

Downsides: Splitting can land in the middle of a sentence, breaking semantic meaning. An overlap parameter blindly repeats content without understanding where natural boundaries exist. Fixed-size chunking works as a baseline but underperforms semantic approaches by 15–25% on real-world retrieval tasks.

Semantic Chunking: Splitting by Meaning

Semantic chunking exploits document structure—sentences, paragraphs, sections, headings—to keep related content together. The key insight is that sentences within a paragraph are semantically closer than sentences separated by a page break. Modern approaches compute the semantic distance between consecutive sentences using embeddings and split wherever distance exceeds a threshold.

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text: str, model_name: str = "all-MiniLM-L6-v2", threshold: float = 0.3) -> list[str]:
"""Split text into semantic chunks based on sentence similarity."""
model = SentenceTransformer(model_name)
sentences = text.split('. ')

if len(sentences) < 2:
return [text]

embeddings = model.encode(sentences)

# Compute cosine similarity between consecutive sentences
chunks = []
current_chunk = [sentences[0]]

for i in range(1, len(sentences)):
# Cosine similarity: dot product of normalized vectors
sim = np.dot(embeddings[i], embeddings[i-1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1]) + 1e-8
)

# Start a new chunk if similarity drops below threshold
if sim < threshold:
chunks.append('. '.join(current_chunk) + '.')
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])

# Add final chunk
if current_chunk:
chunks.append('. '.join(current_chunk) + '.')

return chunks

# Example: threshold of 0.3 means "start a new chunk when the topic shifts"
text = """
Machine learning is a subset of artificial intelligence. AI systems learn from data.
The weather today is sunny. Tomorrow may bring rain.
"""
chunks = semantic_chunk(text, threshold=0.3)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk[:50]}...")

This approach improves retrieval by 15–20% because chunks respect semantic boundaries. Threshold tuning is key: too high (e.g., 0.8) creates large chunks; too low (e.g., 0.1) over-splits. Start with 0.3 and adjust based on retrieval evaluation (covered in article 8).

Hierarchical and Metadata-Aware Chunking

Production systems often exploit document metadata—headings, sections, lists—to create a hierarchy of chunks. A section becomes a parent chunk, and subsections become child chunks. This allows a retriever to fetch both the fine-grained subsection and the broader context.

import re

def hierarchical_chunk(text: str, max_chunk_size: int = 512) -> list[dict]:
"""Split text hierarchically, preserving section structure."""
chunks = []

# Parse headings and sections (Markdown format)
lines = text.split('\n')
current_section = ""
current_heading = ""
current_level = 0

for line in lines:
# Detect heading level (# = h1, ## = h2, etc.)
heading_match = re.match(r'^(#+)\s+(.*)', line)

if heading_match:
# Flush current section if it exceeds size
if len(current_section) > max_chunk_size and current_section:
chunks.append({
"content": current_section,
"heading": current_heading,
"level": current_level
})
current_section = ""

# Update heading context
current_level = len(heading_match.group(1))
current_heading = heading_match.group(2)
else:
# Accumulate content under current heading
current_section += line + '\n'

# Flush final section
if current_section:
chunks.append({
"content": current_section,
"heading": current_heading,
"level": current_level
})

return chunks

# Example: document with headings
doc = """
# Machine Learning Basics
Machine learning enables systems to learn from data.

## Supervised Learning
Supervised learning uses labeled data.

## Unsupervised Learning
Unsupervised learning discovers hidden patterns.
"""

chunks = hierarchical_chunk(doc, max_chunk_size=200)
for chunk in chunks:
print(f"[L{chunk['level']}] {chunk['heading']}: {chunk['content'][:40]}...")

Hierarchical chunking is particularly effective for long documents with clear structure (books, API documentation, wikis). It enables a retriever to surface both fine-grained and broader context, improving answer quality.

Chunking Best Practices

Chunk Size: Empirically, 256–512 tokens is the sweet spot for most domains. This range balances context with precision. For dense technical docs (code, math), stay toward 256–384; for narrative (customer testimonials, case studies), go toward 512–768.

Overlap: Use 10–20% overlap (e.g., 50 tokens in a 512-token chunk) to avoid losing information at chunk boundaries. More overlap increases redundancy but helps retrieval; less overlap saves storage and improves latency. Test with your evaluation set (article 8).

Handling Special Content: Tables, code blocks, and lists need care. For code, preserve indentation and structure; for tables, optionally convert to markdown (easier for LLMs to parse). For lists, keep list items together; never split a list item across chunks.

Metadata Enrichment: Always preserve metadata: source document name, page/section number, timestamp, and any access control tags. Store this in your vector database alongside embeddings.

ApproachProsConsBest For
Fixed-SizeSimple, fast, deterministicIgnores structure, context loss at boundariesBaseline/MVP, homogeneous text
SemanticRespects topic shifts, high accuracySlower (requires embeddings), threshold tuningMost production systems
HierarchicalPreserves structure, multi-level retrievalComplex, needs well-formatted docsLong documents with structure
Sliding WindowMaximal overlap, no context lossHigh redundancy, storage bloatShort documents, critical accuracy

Key Takeaways

  • Chunk size of 256–512 tokens with 10–20% overlap is empirically optimal for most domains.
  • Semantic chunking outperforms fixed-size by 15–25% by respecting topic boundaries.
  • Always preserve metadata (source, section, access tags) alongside chunks.
  • Hierarchical chunking works well for structured documents (books, docs) with clear headings.
  • Evaluate chunking strategy on your own data using retrieval metrics from article 8; there is no universal best approach.

Frequently Asked Questions

How do I choose between semantic and fixed-size chunking?

Start with semantic chunking if your documents have clear structure (headings, paragraphs); it usually outperforms fixed-size. Use fixed-size for homogeneous text (log files, chat transcripts) or as a baseline for comparison. Evaluate both on your retrieval task (article 8) and pick the winner.

What overlap percentage should I use?

A 10–20% overlap is standard. For a 512-token chunk, overlap of 50–100 tokens is typical. Higher overlap (30%+) is safe but increases storage; lower overlap (less than 5%) risks losing context at boundaries. Test on your data to find the local optimum.

Can I change my chunking strategy after indexing?

Yes, but you must reindex. If you have millions of documents, batch the reindexing and swap indices atomically to avoid downtime. Many teams version their indices (e.g., v1_fixed, v2_semantic) to enable rollback.

How do I chunk very long documents (e.g., 100+ page PDFs)?

For books and long reports, use hierarchical chunking with chapters or sections as boundaries. Alternatively, chunk at multiple levels: keep full pages as separate entries alongside finer 256-token chunks, then retrieve both in your pipeline (article 4).

Should I include the heading in each chunk?

Yes. Prepend the section heading to each chunk's content (e.g., "## Supervised Learning: Supervised learning uses labeled data…"). This provides context when the chunk is retrieved alone and improves embedding quality.

Further Reading