Skip to main content

Billion-Scale Vector Operations: Advanced Guide

Scaling vector databases beyond 1 billion embeddings introduces qualitatively different challenges: single-node indexes cannot hold 10B vectors in memory; latency optimization requires multi-stage indexing; and operational complexity explodes. Billion-scale systems require deep infrastructure expertise, but they enable search over the entire web, product catalogs, or user behavior histories at millisecond latency.

Understanding Billion-Scale Challenges

At 1 billion vectors (384-dim embeddings, ~12 TB raw storage):

  • Memory: A single HNSW index requires 10–30 GB of graph metadata. A 10-billion vector index (120 TB) requires proportionally more graph data: 100–300 GB in the best case.
  • Latency: Traversing a larger graph increases search time. HNSW search is O(log n) theoretically, but constants matter. At 10B vectors, a single node's search latency may exceed 500ms.
  • Indexing: Building an HNSW index on 10B vectors takes weeks on a single machine. Distributed indexing is mandatory.
  • Sharding complexity: Distributing across 100+ shards introduces consistency, failover, and routing challenges.

Solution: Multiple strategies must be combined.

Strategy 1: Quantization to Reduce Storage

Quantization compresses embeddings, reducing storage and latency dramatically.

Scalar Quantization (INT8):

Reduce float32 (4 bytes per dimension) to int8 (1 byte):

import numpy as np

def quantize_scalar_int8(embeddings):
"""Quantize float32 embeddings to int8."""
# embeddings.shape = (N, D), dtype = float32

# Find min/max across all embeddings
emb_min = np.min(embeddings)
emb_max = np.max(embeddings)

# Scale to [0, 255]
scaled = (embeddings - emb_min) / (emb_max - emb_min) * 255

# Convert to int8
quantized = scaled.astype(np.int8)

return quantized, emb_min, emb_max

# Quantize 1B embeddings
embeddings = np.random.randn(1_000_000_000, 384).astype(np.float32)
quantized, emb_min, emb_max = quantize_scalar_int8(embeddings)

# Storage reduction
original_size_gb = embeddings.nbytes / 1e9
quantized_size_gb = quantized.nbytes / 1e9
print(f"Original: {original_size_gb:.1f} GB, Quantized: {quantized_size_gb:.1f} GB")
# Output: Original: 384.0 GB, Quantized: 96.0 GB (4x reduction)

Impact: 4–8x storage reduction, 2–3x search latency improvement, 5% recall loss.

Product Quantization (PQ):

Partition each embedding into m subvectors and quantize independently. More complex but higher compression:

def product_quantization(embeddings, m=4, k=256):
"""
Divide embedding into m subvectors, quantize each to k centroids.
Result: 1 byte per subvector = 384/4 = 96 bytes per embedding (from 1536 bytes).
"""
N, D = embeddings.shape
sub_dim = D // m

codebooks = [] # m codebooks, each with k centroids
encoded = np.zeros((N, m), dtype=np.uint8)

for i in range(m):
# Get subvector
sub_embeddings = embeddings[:, i*sub_dim:(i+1)*sub_dim]

# Quantize subvector: find k means, assign each to nearest
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=10000)
kmeans.fit(sub_embeddings)

# Encode: replace vector with nearest centroid index (0–255)
encoded[:, i] = kmeans.predict(sub_embeddings)
codebooks.append(kmeans.cluster_centers_)

return encoded, codebooks

# Compress 1B embeddings to 96 bytes each
quantized, codebooks = product_quantization(embeddings, m=16, k=256)
# Each vector is now 16 bytes (16 uint8 values) vs. 1536 bytes originally (98x compression!)

Trade-off: Compression is extreme (10–20x), but accurate distance computation requires multiple distance lookups. Use PQ for billion-scale static archives where storage is the bottleneck.

Qdrant with Quantization:

from qdrant_client.models import QuantizationConfig, ScalarQuantization

client.create_collection(
collection_name="documents_compressed",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
quantization_config=QuantizationConfig(
scalar=ScalarQuantization(
type="int8",
always_ram=False, # Load quantized vectors on demand from disk
)
),
)

# Upsert vectors; Qdrant compresses automatically
client.upsert(
collection_name="documents_compressed",
points=[
PointStruct(id=i, vector=embeddings[i], payload={...})
for i in range(len(embeddings))
]
)

Strategy 2: Hierarchical Indexing (Coarse-to-Fine)

Instead of a flat ANN index over 10B vectors, use a two-stage index:

Stage 1 (coarse): Cluster vectors into buckets.

Run k-means to partition vectors into (say) 1M coarse clusters. Coarse index is tiny and fast.

Stage 2 (fine): Index vectors within each bucket separately.

Each bucket has HNSW or IVF. Searching is: (1) find top-k coarse clusters, (2) search fine indexes in those clusters.

def hierarchical_search(query, num_coarse=1_000_000, k_coarse=10, k_fine=100):
"""
Stage 1: Assign query to top-k coarse clusters (fast, exact).
Stage 2: Search fine index within those clusters (precise).
"""

# Stage 1: Compute distance to coarse cluster centroids
coarse_dists = compute_distances(query, coarse_centroids) # O(1M)

# Find top-k closest clusters
top_cluster_ids = np.argsort(coarse_dists)[:k_coarse] # O(1M log k)

# Stage 2: Search fine index within top clusters
candidates = []
for cluster_id in top_cluster_ids:
fine_results = fine_indexes[cluster_id].search(query, limit=k_fine)
candidates.extend(fine_results)

# Return top-k globally
all_results = sorted(candidates, key=lambda x: x.distance)
return all_results[:k_fine]

Latency: Stage 1 is O(n) where n is number of coarse clusters (~1M), but distance computation is cheap. Stage 2 searches only top-k clusters (say, 10), each with 10M vectors. Total latency: ~5ms (stage 1) + 20ms (stage 2) = 25ms vs. 100–200ms for flat index.

Implementation: Milvus supports hierarchical indexing natively. Qdrant uses collections to partition data by cluster.

Strategy 3: Approximate Distance Computation

At billion-scale, even distance computation (comparing query to all vectors) is slow. Use approximations:

Bit-level operations (SIMD):

Use vectorized CPU instructions (AVX-512) for fast distance computation:

import numpy as np
from scipy.spatial.distance import cdist

# Standard (slow): O(nd^2) where n=1B, d=384
# distances = cdist([query], all_vectors)[0]

# Fast: Use vectorized BLAS
# queries.shape = (batch, d), vectors.shape = (n, d)
# distances.shape = (batch, n)
from scipy.spatial.distance import cdist as scipy_cdist

distances = scipy_cdist(
[query], all_vectors, metric='cosine' # Uses BLAS under the hood
)[0]
# With AVX-512, processes multiple vectors in parallel

Locality Sensitive Hashing (LSH):

Hash vectors to bucketes using random projections. Nearby vectors hash to same bucket:

def lsh_hash(vector, num_tables=10, bits_per_table=20):
"""Hash vector into multiple hash tables."""
hashes = []

for table in range(num_tables):
# Random projection: dot product with random bits
random_bits = np.random.randn(len(vector), bits_per_table)
projections = np.dot(vector, random_bits) # Shape: (bits_per_table,)

# Binarize: sign of projection
hash_bits = (projections > 0).astype(int)
hash_val = int(''.join(map(str, hash_bits)), 2) # Convert binary to int

hashes.append(hash_val)

return hashes

# Query: hash it, retrieve vectors from matching buckets
query_hashes = lsh_hash(query)
candidates = set()
for table, hash_val in enumerate(query_hashes):
candidates.update(hash_tables[table][hash_val]) # Retrieve bucket contents

# Compute true distance only for candidates (much smaller set)
distances = cdist([query], vectors[list(candidates)])[0]

Trade-off: LSH is approximate; only vectors in matching buckets are considered, possibly missing some neighbors. Benefit: retrieval is O(1), avoiding the O(n) scan.

Strategy 4: Distributed Query Processing

With 100+ shards, coordinate queries efficiently:

import asyncio
from aiohttp import ClientSession

async def distributed_search(query, shards):
"""Search all shards in parallel, aggregate results."""

async def search_shard(session, shard):
async with session.post(
f"http://{shard.host}:{shard.port}/search",
json={"vector": query, "limit": 100}
) as resp:
return await resp.json()

async with ClientSession() as session:
tasks = [search_shard(session, shard) for shard in shards]
shard_results = await asyncio.gather(*tasks)

# Aggregate: merge all results, re-rank, return top-k
all_results = []
for results in shard_results:
all_results.extend(results["matches"])

# Global re-ranking (expensive if thousands of results)
all_results.sort(key=lambda x: x["distance"])

return all_results[:100]

Optimization: Bloom filters to prune shards before searching.

If you know shard 0 contains only documents from 2024, skip it when searching for documents from 2026:

class ShardMetadata:
def __init__(self, shard_id, date_range, payload_ranges):
self.shard_id = shard_id
self.date_range = date_range # (start_date, end_date)
self.payload_ranges = payload_ranges # {"category": ["a", "b", ...]}

def might_contain(self, filter_condition):
"""Fast check: does this shard *possibly* contain matching vectors?"""
start, end = filter_condition["created_at"]

# If date range doesn't overlap, skip shard
if end < self.date_range[0] or start > self.date_range[1]:
return False

return True

# Prune shards before searching
shards_to_search = [
shard for shard in all_shards
if shard.might_contain(filter_condition)
]

Example: Milvus at Billion-Scale

Milvus is designed for billion-scale:

from pymilvus import Collection, connections

# Connect to Milvus cluster (multiple nodes)
connections.connect("default", host="milvus-coordinator", port=19530)

# Create collection with sharding
collection = Collection(
name="documents_1b",
schema=schema,
using="default"
)

# Milvus automatically shards across coordinator/data nodes
# Insert 1B vectors
for batch in batches_of_10m_vectors:
collection.insert(batch)

# Index with quantization to reduce memory
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {
"M": 16,
"ef_construct": 200,
}
}
collection.create_index("vector", index_params)

# Milvus handles:
# - Sharding across 20+ data nodes
# - Replication for HA
# - Hierarchical indexing
# - Query aggregation

# Search
results = collection.search(
data=[query_vector],
anns_field="vector",
param={"metric_type": "COSINE"},
limit=100,
)

Operational Challenges at Billion-Scale

Compaction/garbage collection: With 1B vectors, deleting or updating old data requires expensive compaction. Schedule this during low-traffic windows.

Backup/restore: A 1 TB+ backup cannot be restored quickly. Use incremental backups + WAL instead.

Cluster growth: Adding 10 new nodes requires re-sharding, data migration, and careful coordination.

Debugging: With 100+ nodes, failures are inevitable. Maintain excellent logging, tracing, and runbooks.

Key Takeaways

  • Quantization (INT8, PQ) reduces storage and latency by 4–20x with acceptable recall loss.
  • Hierarchical indexing (coarse-to-fine clustering) keeps search latency under 100ms at 10B scale.
  • Distributed query processing across shards requires aggregation and global re-ranking.
  • Milvus, with Kubernetes, automates much of the complexity; manage it carefully.
  • Billion-scale requires 100+ person-hours of infrastructure engineering. Start smaller if possible.

Frequently Asked Questions

Can I reach billion-scale without sharding?

Practically, no. A single HNSW node can hold ~1B vectors at extreme latency (hundreds of ms). Queries across 1B vectors require sharding to stay under 100ms SLA. The exception: batch-style systems where 1+ second latency is acceptable.

How much memory do I need for 1B vectors?

Embeddings (384 dims): 1B × 384 × 4 bytes = ~1.5 TB. HNSW graph structure (M=16): 1B × 16 × 8 bytes = ~128 GB. Total: ~1.6 TB storage, ~150 GB RAM for working set. Quantization (INT8) reduces to ~400 GB storage, ~50 GB RAM.

How long does backup/restore take at billion-scale?

Snapshot backup: 2–8 hours to write 1 TB to S3. Restore: 4–12 hours to transfer and build indexes. Use incremental snapshots and WAL replay for faster recovery.

Should I use PQ (Product Quantization) at billion-scale?

Only if storage is the bottleneck. PQ achieves 10–20x compression but requires additional distance computations at search time. For most billion-scale systems, INT8 quantization is the sweet spot: 4x compression, minimal overhead.

Further Reading