Skip to main content

Vector Indexing: HNSW vs IVF Guide

Vector indexing is the core mechanism that makes semantic search fast. Two algorithms dominate production vector databases: HNSW (Hierarchical Navigable Small World) and IVF (Inverted File). Both are approximate nearest neighbor (ANN) algorithms that trade recall (accuracy) for speed. Understanding their trade-offs, parameters, and tuning lets you design indexes that meet your latency and recall SLAs.

HNSW: Graph-Based Indexing

HNSW builds a hierarchical network graph where each vector is a node, and edges connect nearby vectors. Search starts at the top of the hierarchy and progressively descends to denser layers, visiting only nearby nodes until it converges on the approximate nearest neighbors.

How HNSW works:

  1. Insert phase: New vectors are inserted into the graph. Edges are created to the M nearest neighbors at each layer. Layer assignment is probabilistic; most vectors land in lower layers (dense neighborhoods), a few in upper layers (sparse global structure).

  2. Search phase: A query vector enters at the top layer, greedily visits the nearest neighbor, and moves to the next layer. This repeats until the bottom layer is reached. Candidates are collected and re-ranked by true distance.

  3. Key parameters:

    • M: Number of bidirectional links (edges) per node. Default: 16. Higher M = more connections, better recall, slower search and higher memory.
    • ef_construction: How many candidates to consider during insertion. Default: 200. Higher ef_construction = better recall at insertion time, slower ingestion.
    • ef_search: How many candidates to consider during query. Default: 100. Higher ef_search = better recall, slower queries. Tunable per query.

Strengths:

  • Excellent recall at moderate latency. HNSW achieves >95% recall at 10–50ms latency.
  • Simple, well-understood algorithm with few hyperparameters.
  • Works well for small to medium datasets (<1B vectors).
  • Logarithmic search complexity: O(log n).

Limitations:

  • Memory overhead. HNSW stores edges (metadata) in addition to embeddings. Approximately 2 × M × dimensionality × 4 bytes per vector overhead.
  • Graph construction is greedy; not globally optimal. Rebuilding the entire index is slow; incremental updates degrade quality over time.
  • Latency increases with dimension count (1536-dim embeddings search slower than 384-dim).

HNSW example in Qdrant:

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, HnswConfigDiff

client = QdrantClient("localhost", port=6333)

# Create collection with HNSW index
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

# Configure HNSW parameters
client.update_collection(
collection_name="documents",
hnsw_config=HnswConfigDiff(
m=16, # Default: 16. Increase to 24–32 for higher recall.
ef_construct=200, # Default: 200. Increase to 400+ for better recall at insertion.
ef_search=100, # Default: 100. Tune per query via search parameter 'hnsw_ef'.
)
)

# Search with custom ef_search
results = client.search(
collection_name="documents",
query_vector=[0.1, 0.2, 0.3, ...],
limit=10,
query_params=QueryParams(hnsw_ef=150), # Higher ef_search = more accurate, slower
)

IVF: Clustering-Based Indexing

IVF (Inverted File Index) partitions the vector space into clusters. Search first finds the nearest clusters via a centroid search, then does exact ANN search within those clusters, avoiding examination of far-away clusters.

How IVF works:

  1. Clustering phase: Run k-means on the full dataset to identify k clusters. Vectors are assigned to their nearest cluster centroid and stored in a bucket.

  2. Search phase: Find the nearest cluster centroids to the query. Examine vectors in the nearest nprobe clusters in detail. Return the top-k candidates across examined clusters.

  3. Key parameters:

    • n_list (or ncentroids): Number of clusters. Default: sqrt(N) where N is total vectors. More clusters = finer granularity, better for large N, slower to index.
    • nprobe: How many clusters to search during query. Default: 1. Higher nprobe = better recall, slower search. Tunable per query.

Strengths:

  • Highly scalable. Can efficiently handle billions of vectors via distributed clustering.
  • Low memory overhead. Stores only cluster assignments per vector, no graph structure.
  • Fast clustering setup. k-means is fast; re-clustering is feasible for re-indexing.
  • Excellent for very large, sparse datasets.

Limitations:

  • Recall is sensitive to nprobe and n_list. Poor parameter choices degrade search quality significantly.
  • Cluster imbalance. If clusters are unbalanced, some vectors are searched exhaustively.
  • Latency increases with nprobe. nprobe = k becomes exhaustive search.
  • Dimensionality curse. k-means clustering becomes harder in very high dimensions (1536+).

IVF example in Milvus:

from pymilvus import Collection, connections
from pymilvus.orm.types import DataType

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Define collection schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=100),
]
schema = CollectionSchema(fields, "Document vectors")

# Create collection
collection = Collection("documents", schema)

# Create IVF index
index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {
"nlist": 128, # Number of clusters. Increase for larger datasets.
}
}
collection.create_index("vector", index_params)

# Search with nprobe
search_params = {"metric_type": "L2", "params": {"nprobe": 8}}
results = collection.search(
data=[query_vector],
anns_field="vector",
param=search_params,
limit=10,
output_fields=["doc_id"],
)

HNSW vs IVF: Head-to-Head Comparison

AspectHNSWIVF
Search latency5–50ms (small–medium datasets)1–20ms (larger datasets)
Maximum scale1B–10B10B–100B+
Memory overhead10–20% per vector1–5% per vector
Recall @ latency95%+ @ 50ms85%+ @ 20ms (tuning needed)
Parameter tuningSimple (M, ef)Complex (nlist, nprobe)
Insertion timeSlow (graph construction)Fast (cluster assignment)
Index rebuildExpensive (full rebuild)Fast (re-cluster)
Sparse vectorsPoorExcellent
Recommended for<1B vectors, high recall>1B vectors, high throughput

Tuning Strategy: Finding the Sweet Spot

For HNSW:

  1. Start with defaults: M = 16, ef_construct = 200, ef_search = 100.
  2. Measure recall on a test set (compare search results to brute-force exact ANN). Aim for >95% recall.
  3. If recall is low, increase M to 24 or 32 (10–20% latency increase, 5–10% recall gain).
  4. If recall is still low, increase ef_construct to 400.
  5. At query time, increase ef_search for specific queries where accuracy matters more than latency.

For IVF:

  1. Start with nlist = sqrt(total_vectors). For 1M vectors, nlist ~= 1000.
  2. Set nprobe = sqrt(nlist). For 1000 clusters, nprobe ~= 32.
  3. Measure recall. Aim for >90% recall.
  4. If recall is low, increase nprobe gradually (1.5x, 2x, etc.) and re-measure.
  5. If latency becomes unacceptable, fine-tune nlist and re-cluster.

Advanced: Quantization to Reduce Storage

Both HNSW and IVF can compress embeddings to save storage and improve cache locality:

  • Product Quantization (PQ): Partition each dimension into sub-vectors and encode each sub-vector as a byte. 384-dim float32 becomes 48-byte compressed vector (384 / 8 = 48 bytes). Trade-off: ~5–10% recall loss.

  • Scalar Quantization (SQ): Reduce float32 (4 bytes) to int8 (1 byte) per dimension. 384-dim vector becomes 384 bytes. Minimal recall loss if scale is computed correctly.

Example in Qdrant (using Scalar Quantization):

from qdrant_client.models import ScalarQuantization, QuantizationConfig

client.create_collection(
collection_name="documents_quantized",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
quantization_config=QuantizationConfig(
scalar=ScalarQuantization(
type="int8",
always_ram=True, # Keep original vectors in RAM for high recall.
)
),
)

Quantization reduces storage by 4x and improves latency by 2–3x with minimal recall loss.

Monitoring Index Health

In production, monitor:

  1. Index size: Track memory and disk usage. Alert if exceeding thresholds.
  2. Query latency distribution: Monitor p50, p95, p99 latencies. Alert on spike.
  3. Recall metric: Periodically run a test suite of known queries and measure recall. Alert if recall drops below SLA.
  4. Index staleness: Track time since last full rebuild. Schedule rebuilds if quality degrades.
# Simple recall test
true_neighbors = brute_force_search(query_vector, all_vectors, k=100)
approximate_neighbors = index_search(query_vector, k=100)
recall = len(set(true_neighbors) & set(approximate_neighbors)) / len(true_neighbors)
assert recall >= 0.95, f"Recall dropped to {recall}!"

Key Takeaways

  • HNSW excels for <1B vectors with high recall; IVF excels for >1B vectors and sparse data.
  • Tuning parameters (M, ef_search for HNSW; nlist, nprobe for IVF) is essential for production SLAs.
  • Quantization reduces storage and latency with minimal accuracy loss; use it for large embeddings (1536 dims).
  • Monitor recall and latency in production; rebuild indexes incrementally to maintain quality.

Frequently Asked Questions

How do I choose between HNSW and IVF?

Start with HNSW if your dataset is <1B vectors and recall matters more than throughput. Choose IVF if you have >1B vectors or extreme throughput needs and can tolerate lower recall. Many teams use both: HNSW for dense, high-recall workloads and IVF for large, archive workloads.

Can I change index parameters without rebuilding the entire index?

Some parameters (ef_search for HNSW, nprobe for IVF) are tunable per query without rebuilding. Others (M, nlist) require a full index rebuild. Check your vector database documentation. Rebuilding typically involves re-ingesting all vectors.

What is the difference between exact ANN and approximate ANN?

Exact ANN returns the true k nearest neighbors. Approximate ANN (HNSW, IVF) returns an approximation, trading accuracy for speed. Most production systems accept >95% recall (missing 1 in 20 true neighbors) for 10–100x speed gain.

Should I use quantization?

Yes, for large-scale systems with tight latency budgets. Quantization (int8, PQ) reduces storage by 4–8x and improves latency by 2–3x with <5% recall loss. Always test on your workload before production deployment.

Further Reading