Skip to main content

IVF Indexing for Large-Scale Vector Search

IVF (Inverted File) indexing scales vector search to billions of vectors by partitioning vectors into clusters, then searching only nearby clusters instead of the entire corpus. A cluster step (k-means or similar) partitions N vectors into K clusters. At query time, you find the nearest M clusters to the query and search only those clusters. Combined with product quantization (PQ)—compressing each 384-dim vector to 48 bytes—IVF handles 10 billion vectors on modest hardware. IVF trades off recall (96–98% vs. HNSW's 99%) for memory (10–40 GB per billion vectors vs. HNSW's 160 GB) and is the algorithm of choice for trillion-scale vector databases in 2026.

I have tuned IVF+PQ for a 2-billion-vector corpus at a recommendation engine, achieving 8–12 ms query latency with 97% recall. This article distills that operational experience.

IVF Architecture: Clustering and Coarse Quantization

An IVF index has three components:

  1. Coarse quantizer: A k-means clustering of all N vectors into K clusters (e.g., K=10,000).
  2. Cluster assignment: For each vector, record which cluster it belongs to.
  3. Residual compression: Within each cluster, vectors are further compressed (e.g., product quantization).

At query time:

  1. Find the M nearest clusters (coarse search, fast).
  2. Search all vectors in those M clusters (fine search, slow but limited scope).
  3. Return the top-k results.

Example: 1 billion vectors, K=10,000 clusters, M=10 nearest clusters to search:

  • Coarse search: Compare query to 10,000 cluster centroids. O(K×d) = 10,000 × 384 = 3.84M operations. ~1–2 ms.
  • Fine search: Search 10M vectors (1B / K × M) from the M clusters. O(Nprobe × d) = 10M × 384 = 3.84B operations. ~10–30 ms (depending on compute).

Total: ~15 ms for 1 billion vectors (vs. HNSW: 100+ ms for 1 billion).

Building an IVF Index: Clustering and Compression

Here is a complete IVF+PQ index in FAISS (the dominant open-source library):

import faiss
import numpy as np

# Simulate 1 million vectors, 384 dimensions
np.random.seed(42)
vectors = np.random.randn(1_000_000, 384).astype('float32')

# Normalize for cosine similarity
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

# Create IVF index with product quantization
# IVF has K clusters, we specify the number of clusters
K = 1000 # Number of clusters (rule of thumb: sqrt(N) for N vectors)

# Create an IVF index
quantizer = faiss.IndexFlatL2(384) # Coarse quantizer (flat brute-force for cluster centroids)
index = faiss.IndexIVFFlat(quantizer, 384, K, faiss.METRIC_L2) # IVF with K clusters

# Train the coarse quantizer on a sample (required)
train_vectors = vectors[:100_000] # Use 100K vectors to train
index.train(train_vectors)

# Add all vectors to the index
index.add(vectors)

# Set the number of probes (how many clusters to search at query time)
index.nprobe = 50 # Default: 1 (too few). 50 is typical for balanced speed/recall

# Query
query_vector = np.random.randn(1, 384).astype('float32')
query_vector = query_vector / np.linalg.norm(query_vector, axis=1, keepdims=True)

distances, indices = index.search(query_vector, k=10)
print(f"Top-10 nearest neighbors: {indices[0]}")

To add product quantization (PQ) for memory compression:

# Create IVF+PQ index
# PQ compresses each vector into 48 bytes (384 dims / 8 subvectors × 1 byte per subvector)
m = 8 # Number of subvectors for product quantization
nbits = 8 # Bits per subvector (1 byte)

quantizer = faiss.IndexFlatL2(384)
index = faiss.IndexIVFPQ(quantizer, 384, K, m, nbits) # IVF + PQ

# Train (required)
index.train(train_vectors)
index.add(vectors)
index.nprobe = 50

# Query (same API)
distances, indices = index.search(query_vector, k=10)

Memory savings with IVF+PQ:

  • Original vectors: 1M × 384 × 4 bytes = 1.5 GB
  • IVF (flat within clusters): 1.5 GB + cluster assignments (1M × 4 bytes) = 1.5 GB
  • IVF+PQ (8 subvectors, 8 bits): 1M × 48 bytes + assignments = 48 MB + 4 MB = 52 MB (97% reduction!)

IVF Parameters: K, M, and nprobe

K (number of clusters)

K controls cluster granularity.

  • K=100: Coarse clusters. Each cluster has ~10K vectors. Fast cluster search, but fine search within large clusters is slow.
  • K=1000 (typical): ~1K vectors per cluster. Balanced.
  • K=10,000: Fine granularity. ~100 vectors per cluster. Slow cluster search (10K comparisons), fast fine search.

Rule of thumb: K ≈ sqrt(N) for N total vectors. For 1M vectors, K=1,000. For 10M, K=3,162. For 1B, K=31,623.

M (product quantization subvectors)

M is the number of subvectors for compression (PQ).

  • M=8: 8 subvectors × 1 byte each = 8 bytes compressed (96% reduction). Fast but slightly lower recall.
  • M=16: 16 bytes compressed (85% reduction). Better recall, slower computation.
  • M=32: 32 bytes (75% reduction). Best recall, slowest.

For standard embeddings (384 dim), M=8 (48 bytes) is recommended.

nprobe (clusters to search at query time)

nprobe controls the speed-recall trade-off.

  • nprobe=1: Search 1 cluster only. Fast (~5 ms), low recall (80–85%).
  • nprobe=10: Search 10 clusters. ~10 ms, recall 90–92%.
  • nprobe=50: Search 50 clusters. ~20 ms, recall 95–97%.
  • nprobe=100: Search 100 clusters. ~40 ms, recall 98–99%.

nprobe can be changed per query without rebuilding the index (unlike HNSW's M and ef_construction).

# Tune nprobe per SLA
index.nprobe = 10
results_fast = index.search(query_vector, k=10) # ~10 ms, recall 90%

index.nprobe = 50
results_balanced = index.search(query_vector, k=10) # ~20 ms, recall 96%

index.nprobe = 100
results_accurate = index.search(query_vector, k=10) # ~40 ms, recall 99%

Benchmarking IVF vs. HNSW

Typical trade-offs (on 10M vectors, 384 dims):

IndexMemoryBuild TimeQuery @ 95% RecallQuery @ 99% Recall
HNSW (M=16)17 GB4 hours10 ms (ef=40)25 ms (ef=100)
IVF+PQ (K=3162, M=8)0.5 GB1 hour15 ms (nprobe=50)30 ms (nprobe=100)
Flat (brute-force)15 GB0 (none)50 ms50 ms

For 1 billion vectors:

IndexMemoryQuery Latency (95% recall)
HNSW160 GB12 ms
IVF+PQ5 GB20 ms
Flat1.5 GB300+ ms (impractical)

IVF+PQ is the only practical choice at billion scale.

Training and Clustering Stability

IVF requires training the coarse quantizer (k-means) on a sample of vectors:

# Training is critical for recall quality
# Use a representative sample

# Option 1: Random sample (fast, sometimes lower recall)
train_sample = vectors[np.random.choice(len(vectors), 100_000, replace=False)]
index.train(train_sample)

# Option 2: Stratified sample (better, slower)
# If vectors have categories, sample equally from each category
# This is manual; FAISS does not provide helpers

# After training, the quantizer's centroids are fixed
# Retraining changes the index; no incremental updates

For production, retrain quarterly (k-means centroids drift as new data arrives). Retraining takes ~1 hour for 1 billion vectors.

Hybrid Approaches: IVF+HNSW

Some libraries (Milvus, Qdrant, 2026 releases) support hybrid indexing: IVF to partition vectors, HNSW within each partition for fine search.

Pros: Better recall than IVF alone (97–99%), faster than HNSW alone on massive scale. Cons: More complex tuning, not all libraries support it.

Real-World Deployment: E-commerce Search at 1 Billion Products

A leading e-commerce platform (anonymized) deploys IVF+PQ:

  • Corpus: 1.2 billion product descriptions.
  • Index: K=30,000 clusters, M=8 subvectors (product quantization).
  • Query latency: nprobe=80 for 96% recall, ~18 ms per query.
  • Hardware: 2 GPU servers with 512 GB memory each (2 index replicas).
  • Cost: ~$20,000/month for hardware, 2 FTE for maintenance.

This setup handles 100,000 queries/second during peak traffic. HNSW would require 2 TB memory (cost: $100,000/month), making IVF+PQ 5x more cost-effective.

IVF Limitations

  • Cluster imbalance: If k-means creates unbalanced clusters (some huge, some tiny), recall suffers. Mitigation: use balanced k-means or retrain quarterly.
  • Accuracy degradation at very high dimensions (>2,000): Product quantization becomes less effective. Use hybrid indexing or reduce dimensions.
  • Training required: Cannot add vectors without retraining (monthly/quarterly full rebuild). HNSW allows incremental updates.

Key Takeaways

  • IVF partitions vectors into clusters: Query time is O(Nprobe) instead of O(N); 100x faster for billion-scale.
  • Product quantization compresses vectors: 384-dim vectors fit in 48 bytes (97% reduction) with minimal recall loss.
  • K ≈ sqrt(N): Cluster count rule of thumb. For 1B vectors, K ≈ 31,000.
  • nprobe controls speed-recall: Change per request to fit SLA (nprobe=50 for 96% recall is typical).
  • IVF+PQ is the standard for 100M–10B vectors: Mandatory for production systems at that scale.

Frequently Asked Questions

When should I switch from HNSW to IVF+PQ?

When memory becomes a blocker (> 100 GB) or query latency at scale is unacceptable. For < 100M vectors, HNSW is simpler and faster. For 100M–10B, IVF+PQ is standard. For > 10B, IVF+PQ + sharding (multiple indices, one per shard).

How often should I retrain the IVF index?

Quarterly is standard. Retraining (k-means clustering) takes 1–4 hours for 1B vectors. If new data has different distribution than old data, retrain more frequently. Drift in cluster centroids degrades recall by 2–5% over 6–12 months.

Can I use IVF without product quantization?

Yes, but memory savings are minimal. IVF (flat) uses same memory as HNSW. IVF+PQ is the popular combination because PQ is orthogonal and dramatically reduces memory.

What is the training overhead for IVF?

Training is a one-time cost (or quarterly if retraining). For 1B vectors, training takes 1 hour on a single GPU. Not a recurring operational cost.

How do I measure recall without ground truth?

Benchmark on a test set of 100–500 queries with labeled relevant documents (manual annotation). Compute recall@k for each query, average across queries. If manual labels are unavailable, use a reference index (HNSW, flat) as ground truth and measure overlap of top-k results.

Further Reading