Recursive chunking strategies: Intelligent hierarchical document division
Recursive chunking improves on fixed-size splitting by respecting document structure. Instead of blindly splitting every N tokens, it splits recursively on natural boundaries: paragraphs first, then sentences, then tokens. If a paragraph fits within chunk_size, keep it whole. If it exceeds chunk_size, subdivide further. This preserves semantic coherence while maintaining deterministic chunk sizes.
Recursive chunking strikes a balance between simplicity and structure-awareness. It outperforms fixed-size chunking on complex documents (code, legal text, mixed layouts) by 3–12% on retrieval metrics, according to 2025 benchmarks, while remaining fast and simple to implement. It's the strategy LangChain uses by default in its RecursiveCharacterTextSplitter.
The Recursive Splitting Algorithm
The core idea: try splitting on progressively finer boundaries until chunks fit.
def split_recursive(text: str,
chunk_size: int = 512,
overlap: int = 128,
separators: list = None) -> list[dict]:
"""Recursively split text on boundaries, preserving structure."""
if separators is None:
# Default: paragraph, double newline, sentence, word, character
separators = ["\n\n", "\n", ". ", " ", ""]
chunks = []
current_chunk = ""
def _estimate_tokens(text: str) -> int:
"""Rough token estimate: ~4 chars per token for English."""
return len(text) // 4
def _split_on_separator(text: str, separator: str) -> list[str]:
"""Split text on separator, preserving separator at end of chunks."""
if not separator:
return list(text) # Split into characters
if separator not in text:
return [text]
parts = text.split(separator)
# Reconstruct with separator appended (except last)
result = []
for i, part in enumerate(parts):
if i < len(parts) - 1:
result.append(part + separator)
else:
result.append(part)
return [p for p in result if p]
def _recursive_split(text: str, separators: list, depth: int = 0) -> list[str]:
"""Recursively split on decreasing granularity until chunks fit."""
if _estimate_tokens(text) <= chunk_size:
return [text]
# Try next separator
if depth >= len(separators):
# Last resort: split on character (should rarely happen)
return [text[i:i+chunk_size*4] for i in range(0, len(text), chunk_size*4)]
separator = separators[depth]
parts = _split_on_separator(text, separator)
# If separator doesn't exist or can't split further, try next
if len(parts) == 1:
return _recursive_split(text, separators, depth + 1)
# Recursively process each part
all_chunks = []
for part in parts:
if _estimate_tokens(part) > chunk_size:
# Recursively split this part
sub_chunks = _recursive_split(part, separators, depth + 1)
all_chunks.extend(sub_chunks)
else:
all_chunks.append(part)
return all_chunks
# Get recursive splits
text_chunks = _recursive_split(text, separators)
# Merge small chunks and apply overlap
merged_chunks = []
current = ""
for chunk in text_chunks:
if _estimate_tokens(current + chunk) <= chunk_size:
current += chunk
else:
if current:
merged_chunks.append(current)
current = chunk
if current:
merged_chunks.append(current)
# Add overlap
final_chunks = []
for i, chunk in enumerate(merged_chunks):
if i > 0:
# Include last `overlap` tokens of previous chunk
prev_tokens = merged_chunks[i-1][-overlap*4:] # Rough: overlap*4 chars
chunk = prev_tokens + chunk
final_chunks.append({
"text": chunk,
"chunk_idx": i,
"token_count": _estimate_tokens(chunk)
})
return final_chunks
Language-Specific Separators
Different document types have different natural boundaries.
def get_separators_for_type(doc_type: str) -> list[str]:
"""Get recommended separators for different document types."""
separators_map = {
"markdown": ["\n\n", "\n", "## ", "# ", ". ", " ", ""],
"python": ["\n\ndef ", "\nclass ", "\n\n", "\n", " ", ""],
"javascript": ["\n\nfunction ", "\nclass ", "\n\n", "\n", ";", " ", ""],
"html": ["\n</div>\n", "\n</p>\n", "\n</section>\n", "\n", " ", ""],
"legal": ["\n\n", "\nSection ", "\nArticle ", ". ", " ", ""],
"default": ["\n\n", "\n", ". ", " ", ""]
}
return separators_map.get(doc_type, separators_map["default"])
# Usage
separators = get_separators_for_type("python")
chunks = split_recursive("import os\n\ndef my_function():\n pass", separators=separators)
Specialized Splitting for Code
Code has structure (functions, classes, imports). Preserve it.
def split_code(code: str, chunk_size: int = 512, language: str = "python") -> list[dict]:
"""Split code while preserving function and class boundaries."""
if language == "python":
# Split on: module, class definition, function definition, blocks, lines
separators = [
"\n\nclass ", # Class definitions
"\n\ndef ", # Function definitions
"\n\nasync def ", # Async functions
"\n\n", # Blank lines
"\n", # Line breaks
";" # Statement end (for single-line code)
]
elif language == "javascript":
separators = [
"\nclass ",
"\nfunction ",
"\nconst ",
"\nlet ",
"\n\n",
"\n",
";"
]
elif language == "rust":
separators = [
"\n\nimpl ",
"\n\nfn ",
"\n\npub fn ",
"\n\npub struct ",
"\n\n",
"\n",
";"
]
else:
separators = ["\n\n", "\n", " ", ""]
return split_recursive(code, chunk_size=chunk_size, separators=separators)
# Example: split Python code
code = '''
def hello():
"""A simple function."""
print("Hello, world!")
class MyClass:
def __init__(self):
self.value = 42
'''
chunks = split_code(code, language="python")
for chunk in chunks:
print(f"--- Chunk {chunk['chunk_idx']} ---")
print(chunk["text"][:100])
Handling Mixed Content (Text + Code)
Documents with embedded code blocks need special handling.
import re
def split_mixed_content(text: str, chunk_size: int = 512) -> list[dict]:
"""Split text while preserving code blocks as atomic units."""
# Find code blocks (```language ... ```)
code_pattern = r'```[\w]*\n(.*?)\n```'
code_blocks = list(re.finditer(code_pattern, text, re.DOTALL))
chunks = []
last_end = 0
chunk_idx = 0
for code_match in code_blocks:
# Process text before code block
text_before = text[last_end:code_match.start()]
if text_before.strip():
text_chunks = split_recursive(text_before, chunk_size=chunk_size)
for tc in text_chunks:
tc["chunk_idx"] = chunk_idx
chunks.append(tc)
chunk_idx += 1
# Add code block as atomic unit (with context window adjustment if needed)
code_text = code_match.group(0)
code_size = len(code_text) // 4 # Rough token estimate
if code_size <= chunk_size:
chunks.append({
"text": code_text,
"chunk_idx": chunk_idx,
"token_count": code_size,
"type": "code"
})
else:
# Code block is very large; split it recursively
code_inner = code_match.group(1)
code_chunks = split_recursive(code_inner, chunk_size=chunk_size)
for cc in code_chunks:
cc["chunk_idx"] = chunk_idx
chunks.append(cc)
chunk_idx += 1
chunk_idx += 1
last_end = code_match.end()
# Process remaining text after last code block
remaining = text[last_end:]
if remaining.strip():
remaining_chunks = split_recursive(remaining, chunk_size=chunk_size)
for rc in remaining_chunks:
rc["chunk_idx"] = chunk_idx
chunks.append(rc)
chunk_idx += 1
return chunks
Comparison: Fixed vs Recursive Chunking
| Aspect | Fixed-Size | Recursive |
|---|---|---|
| Speed | Very fast (O(n)) | Fast (O(n log n)) |
| Structure Preservation | None | Good |
| Code/Lists | May split mid-block | Preserves blocks |
| Complexity | Simple | Moderate |
| Token Predictability | Perfect | Good |
| Retrieval Precision | 82–88% | 85–92% |
Recursive chunking is best for mixed or structured documents; fixed-size is better for uniform text and latency-critical systems.
Key Takeaways
- Recursive chunking respects document structure by splitting on progressively finer boundaries.
- Use language/format-specific separators: code uses function/class boundaries; legal uses section boundaries.
- Preserve code blocks as atomic units; don't split functions or class definitions.
- Recursive chunking outperforms fixed-size by 3–12% on complex documents while remaining fast.
- For mixed content (text + code + tables), detect and handle each type separately.
Frequently Asked Questions
Should I use recursive chunking or fixed-size?
Use recursive for structured documents (code, documentation, legal). Use fixed-size for speed-critical systems and uniform text. For most production RAG, recursive with good separators is worth the small latency cost.
What if my document type isn't in the separator list?
Observe the document's natural structure and add separators accordingly. For a science paper: split on "## " (sections), then "\n\n" (paragraphs), then ". " (sentences). Test and iterate.
Can I use regex for more complex boundaries?
Yes, but regex matching adds latency. For complex splitting (HTML tags, custom markers), use regex to pre-identify boundaries, then split on those. Trade: more accuracy but slower processing.
What happens if a single sentence exceeds chunk_size?
The recursive splitter eventually reaches character-level splitting. Long sentences are split into character sequences (bad). To fix: lower chunk_size or pre-split very long sentences manually.
How do I handle multilingual documents?
Different languages have different sentence structures. Use a language-aware sentence splitter (nltk, spacy) instead of simple ". " splitting. For Chinese/Japanese (no space), use character-based splitting earlier in recursion.