RAG chunking strategies: Understanding document parsing
RAG (Retrieval-Augmented Generation) systems depend entirely on the quality of document chunks they retrieve. A chunk is a contiguous segment of text extracted from a source document, annotated with metadata, and indexed for vector search. Chunking quality directly determines whether your LLM receives the right context to answer questions correctly.
The RAG chunking pipeline has five core stages: document ingestion (extracting raw text), normalization (cleaning and standardizing content), chunking (segmenting into retrievable units), enrichment (adding metadata and embeddings), and indexing (storing for fast retrieval). This article establishes the conceptual foundation; subsequent articles dive into each stage with code and benchmarks.
Why Document Parsing and Chunking Matter for RAG
A RAG system's retrieval quality depends on chunk quality far more than model capability. Consider a customer support bot trained on a 500-page knowledge base. If you chunk naively (e.g., every 512 tokens, regardless of meaning), you'll split complex procedures in the middle, lose context boundaries, and force the LLM to reason across fragmented sentences. According to research by Gao et al. (2023), poorly chunked documents reduce RAG accuracy by 18–35% compared to semantically aware chunking, even with the same embedding model.
Chunks serve two simultaneous purposes: they are the atoms of vector search (the retriever needs chunks small enough to index and retrieve quickly) and the atoms of context (the LLM needs chunks large enough to be self-contained). This tension is central to chunking design.
The Five-Stage RAG Ingestion Pipeline
Stage 1: Document Ingestion
Extract raw text from diverse source formats (PDF, HTML, Word, Markdown, images with OCR). This stage preserves document structure metadata: headings, tables, lists, page boundaries. Most failures in RAG stem from poor ingestion—text extracted incorrectly, tables flattened to gibberish, or metadata discarded.
# Minimal ingestion example: extract from a PDF and track source
import PyPDF2
def ingest_pdf(file_path: str) -> list[dict]:
"""Extract pages and text from a PDF, preserving page numbers."""
chunks_with_metadata = []
with open(file_path, 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for page_num, page in enumerate(reader.pages, start=1):
text = page.extract_text()
chunks_with_metadata.append({
"text": text,
"source": file_path,
"page": page_num,
"format": "pdf"
})
return chunks_with_metadata
Stage 2: Text Normalization
Clean extracted text: remove boilerplate (headers, footers), normalize whitespace, decode entities, handle encoding errors. Normalization is invisible but critical—a single stray Unicode character can cause embedding models to fail silently.
Stage 3: Chunking
Decide on a chunking strategy (fixed-size, recursive, semantic, etc.) and split the normalized text into segments. This is the focus of articles 6–8. Each chunk strategy trades latency against semantic coherence.
Stage 4: Metadata Enrichment
Attach rich metadata to each chunk: hierarchical section titles, semantic tags, entity names, similarity scores to related chunks. Metadata enables filtered retrieval and post-retrieval ranking.
Stage 5: Indexing and Retrieval
Embed chunks (using an embedding model like text-embedding-3-large), store in a vector database, and set up retrieval pipelines that rank by similarity, recency, and relevance filters.
Common Chunking Strategies at a Glance
| Strategy | Chunk Size | Pros | Cons | Best For |
|---|---|---|---|---|
| Fixed-size (overlap) | 512–1024 tokens | Simple, uniform, fast | Ignores semantics, may split paragraphs | Quick prototypes, uniform documents |
| Recursive (hierarchical) | Variable (e.g., 1024, then 512) | Preserves structure, handles edge cases | More complex to implement | Code, structured documents |
| Semantic (embedding-aware) | Variable (e.g., 200–800 tokens) | Respects meaning boundaries | Higher latency, requires embeddings | Production RAG, complex documents |
| Layout-aware (vision-based) | Variable (sections, tables, headers) | Preserves 2D structure (for PDFs) | Requires layout-parsing library | Scanned docs, complex layouts |
Defining the Ideal Chunk
Your ideal chunk should:
- Be self-contained: a reader without the rest of the document should understand the chunk's core claim.
- Preserve context boundaries: avoid splitting a single idea, code block, or table across chunks.
- Fit in an LLM window: for Claude 3.5 Sonnet (200K tokens), chunks of 1,000 tokens are easily accommodated, but search-result sets of 5–10 chunks should total 3,000–5,000 tokens so reasoning has space.
- Be retrievable: the chunk's embeddings should be similar to queries that would benefit from its content.
These goals often conflict. A fixed-size 512-token chunk is uniform but may split a sentence. A semantic chunk respects boundaries but requires expensive embedding computations during indexing.
Key Takeaways
- RAG chunking quality determines retrieval precision; poor chunking reduces accuracy by 18–35% even with identical LLMs.
- The ingestion pipeline has five stages: ingestion, normalization, chunking, enrichment, and indexing.
- Common strategies include fixed-size, recursive, semantic, and layout-aware chunking, each with different latency/quality tradeoffs.
- An ideal chunk is self-contained, respects context boundaries, fits in an LLM's context window, and is semantically retrievable.
- Metadata (source, section, entity tags) attached during enrichment enables advanced retrieval techniques.
Frequently Asked Questions
What is the difference between a "chunk" and a "token" in RAG?
A token is a single semantic unit (roughly a word or subword) counted by an encoding scheme. A chunk is a contiguous sequence of tokens (typically 256–2,048 tokens) that forms a retrievable unit. The LLM's context window is measured in tokens; the vector database's retrieval unit is the chunk.
Can I use the same chunking strategy for all document types?
No. PDFs with complex layouts require vision-aware chunking; code repositories benefit from syntax-aware recursive splitting; plain-text articles work well with semantic chunking. Test multiple strategies and benchmark retrieval quality for your specific corpus.
Should I chunk before or after embedding?
Always chunk before embedding. Chunks are the storage unit; embeddings are computed per-chunk. Re-chunking an embedded dataset requires re-embedding, which is expensive.
How do I know if my chunks are too small or too large?
If chunks are too small (< 256 tokens), they lack context and the LLM struggles to reason; retrieval becomes noisy. If chunks are too large (> 2,048 tokens), vector search loses precision and the LLM's context window fills quickly. For most use cases, 512–1,024 tokens is a safe range; benchmark on your specific corpus.
What role does chunk overlap play in retrieval quality?
Overlap (e.g., 50-token overlap between consecutive chunks) ensures that context boundaries do not fall within a semantic unit, improving recall. However, overlap increases storage and indexing costs. A 20–30% overlap is common for semantic chunking.