Semantic chunking for RAG: Split by meaning, not just token count
Semantic chunking splits documents where meaning changes, not at fixed token boundaries. It uses embedding models to compute semantic similarity between consecutive sentences or paragraphs, then places chunk boundaries where similarity drops. The result: chunks that contain a single coherent idea, even if that idea spans 200 or 2,000 tokens.
Semantic chunking requires computing embeddings during indexing (higher latency), but improves retrieval precision by 8–15% compared to fixed-size chunking on complex documents (Gao et al., 2025). For production RAG systems that prioritize quality over speed, semantic chunking is the gold standard. This article covers implementation, tradeoffs, and practical tips for tuning threshold and model choice.
The Core Algorithm
Semantic chunking works by:
- Break text into candidate chunks (sentences or paragraphs).
- Embed each candidate using an embedding model.
- Compute similarity between consecutive chunks.
- Identify boundaries where similarity is low (meaning has shifted).
- Merge/split to reach target chunk size.
import numpy as np
from typing import Callable
def semantic_chunking(text: str,
embed_fn: Callable,
similarity_threshold: float = 0.5,
chunk_size_target: int = 512,
sentence_splitter: Callable = None) -> list[dict]:
"""
Split text semantically using embeddings.
Args:
text: Full document text
embed_fn: Function that takes text and returns embedding vector (1D numpy array)
similarity_threshold: Cosine similarity below this triggers chunk boundary (0–1)
chunk_size_target: Target token count per chunk
sentence_splitter: Function to split text into sentences (default: simple rule)
Returns:
List of semantic chunks with metadata
"""
# Default sentence splitter: split on ". ", "! ", "? " and preserve punctuation
if sentence_splitter is None:
import re
def sentence_splitter(text):
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if s.strip()]
# Split into sentences
sentences = sentence_splitter(text)
if len(sentences) == 1:
return [{"text": text, "chunk_idx": 0}]
# Embed each sentence
embeddings = []
for sentence in sentences:
try:
emb = embed_fn(sentence)
embeddings.append(emb)
except Exception as e:
print(f"⚠ Embedding failed for sentence: {sentence[:50]}... Error: {e}")
# Use zero vector as fallback (will create boundary)
embeddings.append(np.zeros(384)) # Assuming 384-dim embeddings
# Compute similarity between consecutive sentences
similarities = []
for i in range(len(embeddings) - 1):
sim = cosine_similarity(embeddings[i], embeddings[i + 1])
similarities.append(sim)
# Identify boundaries: where similarity is below threshold
boundaries = [0] # Start of text
for i, sim in enumerate(similarities):
if sim < similarity_threshold:
boundaries.append(i + 1) # Boundary after sentence i
boundaries.append(len(sentences)) # End of text
# Build chunks from boundary indices
chunks = []
for i in range(len(boundaries) - 1):
start_idx = boundaries[i]
end_idx = boundaries[i + 1]
chunk_text = ' '.join(sentences[start_idx:end_idx])
chunks.append({
"text": chunk_text,
"chunk_idx": i,
"sentence_count": end_idx - start_idx,
"similarity_boundary": True if i > 0 else False
})
# Optional: merge very small chunks back together
chunks = merge_small_chunks(chunks, min_token_count=50)
return chunks
def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
if len(vec1) == 0 or len(vec2) == 0:
return 0.0
dot_product = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
if norm1 == 0 or norm2 == 0:
return 0.0
return dot_product / (norm1 * norm2)
def merge_small_chunks(chunks: list[dict], min_token_count: int = 50) -> list[dict]:
"""Merge chunks smaller than min_token_count with neighbors."""
if not chunks:
return chunks
merged = []
i = 0
while i < len(chunks):
current = chunks[i]
token_count = len(current["text"].split())
# If current is too small, merge with next (if available)
if token_count < min_token_count and i < len(chunks) - 1:
current["text"] += " " + chunks[i + 1]["text"]
i += 2 # Skip the next chunk since we merged it
else:
merged.append(current)
i += 1
# Re-index chunk_idx
for idx, chunk in enumerate(merged):
chunk["chunk_idx"] = idx
return merged
Embedding Models for Semantic Chunking
Choose an embedding model based on latency and quality needs.
from openai import OpenAI
import numpy as np
# Option 1: OpenAI text-embedding-3-large (high quality, cloud-based)
def embed_openai(text: str, model: str = "text-embedding-3-large") -> np.ndarray:
"""Embed text using OpenAI API."""
client = OpenAI(api_key="your-api-key")
response = client.embeddings.create(
input=text,
model=model
)
return np.array(response.data[0].embedding)
# Option 2: Hugging Face sentence-transformers (local, fast)
from sentence_transformers import SentenceTransformer
def embed_local(text: str, model: SentenceTransformer = None) -> np.ndarray:
"""Embed text using a local sentence-transformer model."""
if model is None:
model = SentenceTransformer("all-MiniLM-L6-v2") # Fast, 384-dim
embedding = model.encode(text, convert_to_numpy=True)
return embedding
# Option 3: Anthropic APIs (if available)
def embed_anthropic(text: str, model: str = "claude-3-5-sonnet") -> np.ndarray:
"""Embed text using Anthropic's embedding capabilities."""
# Note: As of 2026, Anthropic does not offer standalone embeddings API.
# Use Claude's vision for chunking high-quality documents instead.
pass
# Recommendation matrix
print("""
Model Choice for Semantic Chunking:
| Model | Speed | Quality | Cost | Best For |
|-------|-------|---------|------|----------|
| all-MiniLM-L6-v2 | Fast | Good (83%) | Free | Prototypes, local |
| text-embedding-3-small | Fast | Very Good (88%) | $ | Production, balanced |
| text-embedding-3-large | Moderate | Excellent (92%) | $$$ | Mission-critical |
| BAAI/bge-large-en | Very Fast | Good (85%) | Free | Large-scale indexing |
""")
Tuning the Similarity Threshold
The threshold is critical: too high (> 0.7) creates fine-grained chunks; too low (< 0.3) oversimplifies.
def find_optimal_threshold(text: str, embed_fn,
target_chunk_count: int = 10) -> float:
"""
Experiment with thresholds to find one that produces desired chunk count.
"""
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
# Embed all sentences
embeddings = [embed_fn(s) for s in sentences]
# Compute all similarities
similarities = []
for i in range(len(embeddings) - 1):
sim = cosine_similarity(embeddings[i], embeddings[i + 1])
similarities.append(sim)
# Try different thresholds
thresholds = np.arange(0.1, 1.0, 0.1)
results = []
for threshold in thresholds:
boundary_count = sum(1 for sim in similarities if sim < threshold)
chunk_count = boundary_count + 1
results.append((threshold, chunk_count))
# Find threshold closest to target
best_threshold, best_count = min(results, key=lambda x: abs(x[1] - target_chunk_count))
print(f"Threshold {best_threshold:.2f} produces {best_count} chunks (target: {target_chunk_count})")
return best_threshold
Adaptive Chunking: Combine Semantic + Size Constraints
Semantic boundaries alone may create very large or very small chunks. Combine with size constraints.
def semantic_chunking_with_size(text: str,
embed_fn: Callable,
similarity_threshold: float = 0.5,
min_chunk_size: int = 256,
max_chunk_size: int = 1024) -> list[dict]:
"""
Semantic chunking with min/max token size enforcement.
"""
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
# Embed and get similarity-based boundaries
embeddings = [embed_fn(s) for s in sentences]
similarities = []
for i in range(len(embeddings) - 1):
sim = cosine_similarity(embeddings[i], embeddings[i + 1])
similarities.append(sim)
# Identify natural boundaries (low similarity)
boundaries = [0]
for i, sim in enumerate(similarities):
if sim < similarity_threshold:
boundaries.append(i + 1)
boundaries.append(len(sentences))
# Enforce size constraints: merge/split as needed
chunks = []
for i in range(len(boundaries) - 1):
start_idx = boundaries[i]
end_idx = boundaries[i + 1]
chunk_text = ' '.join(sentences[start_idx:end_idx])
token_count = len(chunk_text.split())
# If chunk too small, merge with next
if token_count < min_chunk_size and i < len(boundaries) - 2:
# Merge: extend end_idx
merge_end = boundaries[i + 2]
chunk_text = ' '.join(sentences[start_idx:merge_end])
# Skip next boundary
boundaries[i + 1] = merge_end
# If chunk too large, split by sentences
if token_count > max_chunk_size:
# Split into sentence-level chunks
for j in range(start_idx, end_idx):
chunks.append({
"text": sentences[j],
"chunk_idx": len(chunks),
"within_semantic_boundary": True
})
continue
chunks.append({
"text": chunk_text,
"chunk_idx": len(chunks),
"token_count": token_count
})
return chunks
Performance Tradeoffs
Semantic chunking is slower than fixed-size due to embedding computations.
| Stage | Time (per 100K tokens) | Notes |
|---|---|---|
| Text to sentences | 0.5s | Regex-based |
| Embedding all sentences | 30–300s | Depends on model (local vs cloud) |
| Similarity computation | 0.1s | Fast dot products |
| Boundary detection & merge | 0.2s | O(n) |
| Total | 31–301s | 2–3 orders slower than fixed-size |
For large documents, consider:
- Batching embeddings (OpenAI accepts up to 100 texts per request)
- Using a faster embedding model (MiniLM vs. text-embedding-3-large)
- Pre-computing and caching embeddings
Key Takeaways
- Semantic chunking achieves 85–95% retrieval precision by splitting where meaning changes, not at fixed token boundaries.
- Similarity threshold (0.3–0.7) controls chunk granularity; tune based on your documents.
- Embedding models (sentence-transformers vs. OpenAI) trade off speed and quality.
- Combine semantic boundaries with size constraints (min/max tokens) for robust chunks.
- Semantic chunking is slower than fixed-size but worth it for quality-critical RAG.
Frequently Asked Questions
Should I always use semantic chunking?
Not necessarily. Fixed-size chunking is faster and sufficient for simple documents. Use semantic chunking for complex, mixed-domain documents where meaning boundaries matter: legal contracts, technical books, research papers.
What embedding model should I use?
For prototypes: all-MiniLM-L6-v2 (free, local, 384-dim). For production: text-embedding-3-small (fast, good quality). For maximum quality: text-embedding-3-large. Match your latency/quality needs.
Can I use LLM embeddings (Claude) for chunking?
Claude's embeddings (if exposed via API) could work, but are slower and more expensive. Use dedicated embedding models. Use Claude for post-retrieval ranking instead.
How do I handle very long sentences?
If a single sentence exceeds max_chunk_size, split by character or word boundaries as a fallback. Better: use a sentence splitter that handles long complex sentences better (spaCy, nltk).
What if similarity values are all very high/low?
High similarity throughout means your document is homogeneous—semantic chunking won't help much. Low similarity means boundaries are everywhere—lower your threshold or use fixed-size. Check your embedding model and text quality.