IVF Indexing for Large-Scale Vector Search
IVF (Inverted File) indexing scales vector search to billions of vectors by partitioning vectors into clusters, then searching only nearby clusters instead of the entire corpus. A cluster step (k-means or similar) partitions N vectors into K clusters. At query time, you find the nearest M clusters to the query and search only those clusters. Combined with product quantization (PQ)—compressing each 384-dim vector to 48 bytes—IVF handles 10 billion vectors on modest hardware. IVF trades off recall (96–98% vs. HNSW's 99%) for memory (10–40 GB per billion vectors vs. HNSW's 160 GB) and is the algorithm of choice for trillion-scale vector databases in 2026.
I have tuned IVF+PQ for a 2-billion-vector corpus at a recommendation engine, achieving 8–12 ms query latency with 97% recall. This article distills that operational experience.
IVF Architecture: Clustering and Coarse Quantization
An IVF index has three components:
- Coarse quantizer: A k-means clustering of all N vectors into K clusters (e.g., K=10,000).
- Cluster assignment: For each vector, record which cluster it belongs to.
- Residual compression: Within each cluster, vectors are further compressed (e.g., product quantization).
At query time:
- Find the M nearest clusters (coarse search, fast).
- Search all vectors in those M clusters (fine search, slow but limited scope).
- Return the top-k results.
Example: 1 billion vectors, K=10,000 clusters, M=10 nearest clusters to search:
- Coarse search: Compare query to 10,000 cluster centroids. O(K×d) = 10,000 × 384 = 3.84M operations. ~1–2 ms.
- Fine search: Search 10M vectors (1B / K × M) from the M clusters. O(Nprobe × d) = 10M × 384 = 3.84B operations. ~10–30 ms (depending on compute).
Total: ~15 ms for 1 billion vectors (vs. HNSW: 100+ ms for 1 billion).
Building an IVF Index: Clustering and Compression
Here is a complete IVF+PQ index in FAISS (the dominant open-source library):
import faiss
import numpy as np
# Simulate 1 million vectors, 384 dimensions
np.random.seed(42)
vectors = np.random.randn(1_000_000, 384).astype('float32')
# Normalize for cosine similarity
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
# Create IVF index with product quantization
# IVF has K clusters, we specify the number of clusters
K = 1000 # Number of clusters (rule of thumb: sqrt(N) for N vectors)
# Create an IVF index
quantizer = faiss.IndexFlatL2(384) # Coarse quantizer (flat brute-force for cluster centroids)
index = faiss.IndexIVFFlat(quantizer, 384, K, faiss.METRIC_L2) # IVF with K clusters
# Train the coarse quantizer on a sample (required)
train_vectors = vectors[:100_000] # Use 100K vectors to train
index.train(train_vectors)
# Add all vectors to the index
index.add(vectors)
# Set the number of probes (how many clusters to search at query time)
index.nprobe = 50 # Default: 1 (too few). 50 is typical for balanced speed/recall
# Query
query_vector = np.random.randn(1, 384).astype('float32')
query_vector = query_vector / np.linalg.norm(query_vector, axis=1, keepdims=True)
distances, indices = index.search(query_vector, k=10)
print(f"Top-10 nearest neighbors: {indices[0]}")
To add product quantization (PQ) for memory compression:
# Create IVF+PQ index
# PQ compresses each vector into 48 bytes (384 dims / 8 subvectors × 1 byte per subvector)
m = 8 # Number of subvectors for product quantization
nbits = 8 # Bits per subvector (1 byte)
quantizer = faiss.IndexFlatL2(384)
index = faiss.IndexIVFPQ(quantizer, 384, K, m, nbits) # IVF + PQ
# Train (required)
index.train(train_vectors)
index.add(vectors)
index.nprobe = 50
# Query (same API)
distances, indices = index.search(query_vector, k=10)
Memory savings with IVF+PQ:
- Original vectors: 1M × 384 × 4 bytes = 1.5 GB
- IVF (flat within clusters): 1.5 GB + cluster assignments (1M × 4 bytes) = 1.5 GB
- IVF+PQ (8 subvectors, 8 bits): 1M × 48 bytes + assignments = 48 MB + 4 MB = 52 MB (97% reduction!)
IVF Parameters: K, M, and nprobe
K (number of clusters)
K controls cluster granularity.
- K=100: Coarse clusters. Each cluster has ~10K vectors. Fast cluster search, but fine search within large clusters is slow.
- K=1000 (typical): ~1K vectors per cluster. Balanced.
- K=10,000: Fine granularity. ~100 vectors per cluster. Slow cluster search (10K comparisons), fast fine search.
Rule of thumb: K ≈ sqrt(N) for N total vectors. For 1M vectors, K=1,000. For 10M, K=3,162. For 1B, K=31,623.
M (product quantization subvectors)
M is the number of subvectors for compression (PQ).
- M=8: 8 subvectors × 1 byte each = 8 bytes compressed (96% reduction). Fast but slightly lower recall.
- M=16: 16 bytes compressed (85% reduction). Better recall, slower computation.
- M=32: 32 bytes (75% reduction). Best recall, slowest.
For standard embeddings (384 dim), M=8 (48 bytes) is recommended.
nprobe (clusters to search at query time)
nprobe controls the speed-recall trade-off.
- nprobe=1: Search 1 cluster only. Fast (~5 ms), low recall (80–85%).
- nprobe=10: Search 10 clusters. ~10 ms, recall 90–92%.
- nprobe=50: Search 50 clusters. ~20 ms, recall 95–97%.
- nprobe=100: Search 100 clusters. ~40 ms, recall 98–99%.
nprobe can be changed per query without rebuilding the index (unlike HNSW's M and ef_construction).
# Tune nprobe per SLA
index.nprobe = 10
results_fast = index.search(query_vector, k=10) # ~10 ms, recall 90%
index.nprobe = 50
results_balanced = index.search(query_vector, k=10) # ~20 ms, recall 96%
index.nprobe = 100
results_accurate = index.search(query_vector, k=10) # ~40 ms, recall 99%
Benchmarking IVF vs. HNSW
Typical trade-offs (on 10M vectors, 384 dims):
| Index | Memory | Build Time | Query @ 95% Recall | Query @ 99% Recall |
|---|---|---|---|---|
| HNSW (M=16) | 17 GB | 4 hours | 10 ms (ef=40) | 25 ms (ef=100) |
| IVF+PQ (K=3162, M=8) | 0.5 GB | 1 hour | 15 ms (nprobe=50) | 30 ms (nprobe=100) |
| Flat (brute-force) | 15 GB | 0 (none) | 50 ms | 50 ms |
For 1 billion vectors:
| Index | Memory | Query Latency (95% recall) |
|---|---|---|
| HNSW | 160 GB | 12 ms |
| IVF+PQ | 5 GB | 20 ms |
| Flat | 1.5 GB | 300+ ms (impractical) |
IVF+PQ is the only practical choice at billion scale.
Training and Clustering Stability
IVF requires training the coarse quantizer (k-means) on a sample of vectors:
# Training is critical for recall quality
# Use a representative sample
# Option 1: Random sample (fast, sometimes lower recall)
train_sample = vectors[np.random.choice(len(vectors), 100_000, replace=False)]
index.train(train_sample)
# Option 2: Stratified sample (better, slower)
# If vectors have categories, sample equally from each category
# This is manual; FAISS does not provide helpers
# After training, the quantizer's centroids are fixed
# Retraining changes the index; no incremental updates
For production, retrain quarterly (k-means centroids drift as new data arrives). Retraining takes ~1 hour for 1 billion vectors.
Hybrid Approaches: IVF+HNSW
Some libraries (Milvus, Qdrant, 2026 releases) support hybrid indexing: IVF to partition vectors, HNSW within each partition for fine search.
Pros: Better recall than IVF alone (97–99%), faster than HNSW alone on massive scale. Cons: More complex tuning, not all libraries support it.
Real-World Deployment: E-commerce Search at 1 Billion Products
A leading e-commerce platform (anonymized) deploys IVF+PQ:
- Corpus: 1.2 billion product descriptions.
- Index: K=30,000 clusters, M=8 subvectors (product quantization).
- Query latency: nprobe=80 for 96% recall, ~18 ms per query.
- Hardware: 2 GPU servers with 512 GB memory each (2 index replicas).
- Cost: ~$20,000/month for hardware, 2 FTE for maintenance.
This setup handles 100,000 queries/second during peak traffic. HNSW would require 2 TB memory (cost: $100,000/month), making IVF+PQ 5x more cost-effective.
IVF Limitations
- Cluster imbalance: If k-means creates unbalanced clusters (some huge, some tiny), recall suffers. Mitigation: use balanced k-means or retrain quarterly.
- Accuracy degradation at very high dimensions (>2,000): Product quantization becomes less effective. Use hybrid indexing or reduce dimensions.
- Training required: Cannot add vectors without retraining (monthly/quarterly full rebuild). HNSW allows incremental updates.
Key Takeaways
- IVF partitions vectors into clusters: Query time is O(Nprobe) instead of O(N); 100x faster for billion-scale.
- Product quantization compresses vectors: 384-dim vectors fit in 48 bytes (97% reduction) with minimal recall loss.
- K ≈ sqrt(N): Cluster count rule of thumb. For 1B vectors, K ≈ 31,000.
- nprobe controls speed-recall: Change per request to fit SLA (nprobe=50 for 96% recall is typical).
- IVF+PQ is the standard for 100M–10B vectors: Mandatory for production systems at that scale.
Frequently Asked Questions
When should I switch from HNSW to IVF+PQ?
When memory becomes a blocker (> 100 GB) or query latency at scale is unacceptable. For < 100M vectors, HNSW is simpler and faster. For 100M–10B, IVF+PQ is standard. For > 10B, IVF+PQ + sharding (multiple indices, one per shard).
How often should I retrain the IVF index?
Quarterly is standard. Retraining (k-means clustering) takes 1–4 hours for 1B vectors. If new data has different distribution than old data, retrain more frequently. Drift in cluster centroids degrades recall by 2–5% over 6–12 months.
Can I use IVF without product quantization?
Yes, but memory savings are minimal. IVF (flat) uses same memory as HNSW. IVF+PQ is the popular combination because PQ is orthogonal and dramatically reduces memory.
What is the training overhead for IVF?
Training is a one-time cost (or quarterly if retraining). For 1B vectors, training takes 1 hour on a single GPU. Not a recurring operational cost.
How do I measure recall without ground truth?
Benchmark on a test set of 100–500 queries with labeled relevant documents (manual annotation). Compute recall@k for each query, average across queries. If manual labels are unavailable, use a reference index (HNSW, flat) as ground truth and measure overlap of top-k results.
Further Reading
- FAISS GitHub: Indexing and Search — production IVF+PQ implementation
- Billion-Scale Similarity Search with GPUs by Johnson et al. — IVF+PQ academic foundations
- Milvus Vector Database: IVF and Hybrid Indexing — managed IVF+PQ with operational best practices
- Product Quantization for Nearest Neighbor Search — PQ technique paper