Vector Indexing: HNSW vs IVF Guide
Vector indexing is the core mechanism that makes semantic search fast. Two algorithms dominate production vector databases: HNSW (Hierarchical Navigable Small World) and IVF (Inverted File). Both are approximate nearest neighbor (ANN) algorithms that trade recall (accuracy) for speed. Understanding their trade-offs, parameters, and tuning lets you design indexes that meet your latency and recall SLAs.
HNSW: Graph-Based Indexing
HNSW builds a hierarchical network graph where each vector is a node, and edges connect nearby vectors. Search starts at the top of the hierarchy and progressively descends to denser layers, visiting only nearby nodes until it converges on the approximate nearest neighbors.
How HNSW works:
-
Insert phase: New vectors are inserted into the graph. Edges are created to the M nearest neighbors at each layer. Layer assignment is probabilistic; most vectors land in lower layers (dense neighborhoods), a few in upper layers (sparse global structure).
-
Search phase: A query vector enters at the top layer, greedily visits the nearest neighbor, and moves to the next layer. This repeats until the bottom layer is reached. Candidates are collected and re-ranked by true distance.
-
Key parameters:
- M: Number of bidirectional links (edges) per node. Default: 16. Higher M = more connections, better recall, slower search and higher memory.
- ef_construction: How many candidates to consider during insertion. Default: 200. Higher ef_construction = better recall at insertion time, slower ingestion.
- ef_search: How many candidates to consider during query. Default: 100. Higher ef_search = better recall, slower queries. Tunable per query.
Strengths:
- Excellent recall at moderate latency. HNSW achieves
>95%recall at10–50mslatency. - Simple, well-understood algorithm with few hyperparameters.
- Works well for small to medium datasets (
<1Bvectors). - Logarithmic search complexity:
O(log n).
Limitations:
- Memory overhead. HNSW stores edges (metadata) in addition to embeddings. Approximately
2 × M × dimensionality × 4 bytesper vector overhead. - Graph construction is greedy; not globally optimal. Rebuilding the entire index is slow; incremental updates degrade quality over time.
- Latency increases with dimension count (1536-dim embeddings search slower than 384-dim).
HNSW example in Qdrant:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, HnswConfigDiff
client = QdrantClient("localhost", port=6333)
# Create collection with HNSW index
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
# Configure HNSW parameters
client.update_collection(
collection_name="documents",
hnsw_config=HnswConfigDiff(
m=16, # Default: 16. Increase to 24–32 for higher recall.
ef_construct=200, # Default: 200. Increase to 400+ for better recall at insertion.
ef_search=100, # Default: 100. Tune per query via search parameter 'hnsw_ef'.
)
)
# Search with custom ef_search
results = client.search(
collection_name="documents",
query_vector=[0.1, 0.2, 0.3, ...],
limit=10,
query_params=QueryParams(hnsw_ef=150), # Higher ef_search = more accurate, slower
)
IVF: Clustering-Based Indexing
IVF (Inverted File Index) partitions the vector space into clusters. Search first finds the nearest clusters via a centroid search, then does exact ANN search within those clusters, avoiding examination of far-away clusters.
How IVF works:
-
Clustering phase: Run k-means on the full dataset to identify k clusters. Vectors are assigned to their nearest cluster centroid and stored in a bucket.
-
Search phase: Find the nearest cluster centroids to the query. Examine vectors in the nearest nprobe clusters in detail. Return the top-k candidates across examined clusters.
-
Key parameters:
- n_list (or ncentroids): Number of clusters. Default:
sqrt(N)where N is total vectors. More clusters = finer granularity, better for large N, slower to index. - nprobe: How many clusters to search during query. Default: 1. Higher nprobe = better recall, slower search. Tunable per query.
- n_list (or ncentroids): Number of clusters. Default:
Strengths:
- Highly scalable. Can efficiently handle billions of vectors via distributed clustering.
- Low memory overhead. Stores only cluster assignments per vector, no graph structure.
- Fast clustering setup. k-means is fast; re-clustering is feasible for re-indexing.
- Excellent for very large, sparse datasets.
Limitations:
- Recall is sensitive to nprobe and n_list. Poor parameter choices degrade search quality significantly.
- Cluster imbalance. If clusters are unbalanced, some vectors are searched exhaustively.
- Latency increases with nprobe.
nprobe = kbecomes exhaustive search. - Dimensionality curse. k-means clustering becomes harder in very high dimensions (1536+).
IVF example in Milvus:
from pymilvus import Collection, connections
from pymilvus.orm.types import DataType
# Connect to Milvus
connections.connect("default", host="localhost", port="19530")
# Define collection schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=100),
]
schema = CollectionSchema(fields, "Document vectors")
# Create collection
collection = Collection("documents", schema)
# Create IVF index
index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {
"nlist": 128, # Number of clusters. Increase for larger datasets.
}
}
collection.create_index("vector", index_params)
# Search with nprobe
search_params = {"metric_type": "L2", "params": {"nprobe": 8}}
results = collection.search(
data=[query_vector],
anns_field="vector",
param=search_params,
limit=10,
output_fields=["doc_id"],
)
HNSW vs IVF: Head-to-Head Comparison
| Aspect | HNSW | IVF |
|---|---|---|
| Search latency | 5–50ms (small–medium datasets) | 1–20ms (larger datasets) |
| Maximum scale | 1B–10B | 10B–100B+ |
| Memory overhead | 10–20% per vector | 1–5% per vector |
| Recall @ latency | 95%+ @ 50ms | 85%+ @ 20ms (tuning needed) |
| Parameter tuning | Simple (M, ef) | Complex (nlist, nprobe) |
| Insertion time | Slow (graph construction) | Fast (cluster assignment) |
| Index rebuild | Expensive (full rebuild) | Fast (re-cluster) |
| Sparse vectors | Poor | Excellent |
| Recommended for | <1B vectors, high recall | >1B vectors, high throughput |
Tuning Strategy: Finding the Sweet Spot
For HNSW:
- Start with defaults:
M = 16,ef_construct = 200,ef_search = 100. - Measure recall on a test set (compare search results to brute-force exact ANN). Aim for
>95%recall. - If recall is low, increase
Mto 24 or 32 (10–20% latency increase, 5–10% recall gain). - If recall is still low, increase
ef_constructto 400. - At query time, increase
ef_searchfor specific queries where accuracy matters more than latency.
For IVF:
- Start with
nlist = sqrt(total_vectors). For 1M vectors,nlist ~= 1000. - Set
nprobe = sqrt(nlist). For 1000 clusters,nprobe ~= 32. - Measure recall. Aim for
>90%recall. - If recall is low, increase
nprobegradually (1.5x, 2x, etc.) and re-measure. - If latency becomes unacceptable, fine-tune
nlistand re-cluster.
Advanced: Quantization to Reduce Storage
Both HNSW and IVF can compress embeddings to save storage and improve cache locality:
-
Product Quantization (PQ): Partition each dimension into sub-vectors and encode each sub-vector as a byte. 384-dim float32 becomes 48-byte compressed vector (
384 / 8 = 48 bytes). Trade-off: ~5–10% recall loss. -
Scalar Quantization (SQ): Reduce float32 (4 bytes) to int8 (1 byte) per dimension. 384-dim vector becomes 384 bytes. Minimal recall loss if scale is computed correctly.
Example in Qdrant (using Scalar Quantization):
from qdrant_client.models import ScalarQuantization, QuantizationConfig
client.create_collection(
collection_name="documents_quantized",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
quantization_config=QuantizationConfig(
scalar=ScalarQuantization(
type="int8",
always_ram=True, # Keep original vectors in RAM for high recall.
)
),
)
Quantization reduces storage by 4x and improves latency by 2–3x with minimal recall loss.
Monitoring Index Health
In production, monitor:
- Index size: Track memory and disk usage. Alert if exceeding thresholds.
- Query latency distribution: Monitor p50, p95, p99 latencies. Alert on spike.
- Recall metric: Periodically run a test suite of known queries and measure recall. Alert if recall drops below SLA.
- Index staleness: Track time since last full rebuild. Schedule rebuilds if quality degrades.
# Simple recall test
true_neighbors = brute_force_search(query_vector, all_vectors, k=100)
approximate_neighbors = index_search(query_vector, k=100)
recall = len(set(true_neighbors) & set(approximate_neighbors)) / len(true_neighbors)
assert recall >= 0.95, f"Recall dropped to {recall}!"
Key Takeaways
- HNSW excels for
<1Bvectors with high recall; IVF excels for>1Bvectors and sparse data. - Tuning parameters (M, ef_search for HNSW; nlist, nprobe for IVF) is essential for production SLAs.
- Quantization reduces storage and latency with minimal accuracy loss; use it for large embeddings (1536 dims).
- Monitor recall and latency in production; rebuild indexes incrementally to maintain quality.
Frequently Asked Questions
How do I choose between HNSW and IVF?
Start with HNSW if your dataset is <1B vectors and recall matters more than throughput. Choose IVF if you have >1B vectors or extreme throughput needs and can tolerate lower recall. Many teams use both: HNSW for dense, high-recall workloads and IVF for large, archive workloads.
Can I change index parameters without rebuilding the entire index?
Some parameters (ef_search for HNSW, nprobe for IVF) are tunable per query without rebuilding. Others (M, nlist) require a full index rebuild. Check your vector database documentation. Rebuilding typically involves re-ingesting all vectors.
What is the difference between exact ANN and approximate ANN?
Exact ANN returns the true k nearest neighbors. Approximate ANN (HNSW, IVF) returns an approximation, trading accuracy for speed. Most production systems accept >95% recall (missing 1 in 20 true neighbors) for 10–100x speed gain.
Should I use quantization?
Yes, for large-scale systems with tight latency budgets. Quantization (int8, PQ) reduces storage by 4–8x and improves latency by 2–3x with <5% recall loss. Always test on your workload before production deployment.