Scaling Semantic Caches with Vector Databases
In-memory semantic caches (Article 3) work for <1M cached entries. At 10M+, you need a vector database: a specialized system optimized for approximate nearest-neighbor (ANN) search across billions of vectors. This article compares four production-ready options, covers sharding strategies, and teaches consistency patterns for multi-region deployments. By the end, you will know when to use Pinecone, Weaviate, Milvus, or Postgres pgvector, and how to architect a global semantic cache.
Vector Database Comparison
| Database | Type | Managed? | Scalability | Query Latency (p99) | Cost | Best For |
|---|---|---|---|---|---|---|
| Pinecone | Cloud-native | Yes | Billions of vectors | 50–100 ms | USD 100–5,000/mo | High-volume SaaS, minimal ops |
| Weaviate | Open-source + Cloud | Both | Hundreds of millions | 30–80 ms | Free (OSS) to USD 1,000+/mo | Flexibility, custom filters |
| Milvus | Open-source | Self-hosted | Billions | 20–60 ms | Cost of infrastructure | Cost-sensitive, full control |
| Postgres pgvector | Extension | Self-hosted | Millions | 100–500 ms (no indexing) | Cost of Postgres instance | Existing PG infrastructure |
When to Use Each
Pinecone: You have 10M+ QPS, no custom logic, and want zero ops. Ideal for startups and SaaS. Automated scaling, built-in replication, sub-100ms latency. Cost: USD 2–3K/month for 100M vectors.
Weaviate: You need complex filtering (where topic = "physics") alongside similarity search, or you want both open-source and managed options. Good for knowledge graphs, document retrieval. Cost: USD 500–2K/month (managed), free (self-hosted).
Milvus: You need extreme cost efficiency or full control over infrastructure. Deploy in your data center or Kubernetes. Billions of vectors on modest hardware. Cost: Minimal (just infrastructure). Tradeoff: 3–6 month operational ramp-up.
Postgres pgvector: You already run Postgres; cache size is <10M entries; simplicity beats performance. Lowest friction adoption. Cost: Minimal (existing Postgres). Latency is the tradeoff.
Example: Scaling Semantic Cache to 100M Vectors with Pinecone
import pinecone
class ScaledSemanticCache:
"""Semantic cache backed by Pinecone (cloud-hosted vector database)."""
def __init__(self, pinecone_api_key: str, index_name: str = "semantic-cache",
dimension: int = 1536):
# Initialize Pinecone
pinecone.init(api_key=pinecone_api_key, environment="us-west1-gcp")
# Create or connect to index
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=dimension,
metric="cosine",
pods=1, # Start with 1 pod; auto-scale based on load
pod_type="s1" # Standard tier
)
self.index = pinecone.Index(index_name)
self.threshold = 0.95
def store(self, query: str, embedding: np.ndarray, response: str,
tenant_id: str, metadata: dict = None):
"""Store cache entry in Pinecone."""
# Generate unique vector ID
vector_id = f"{tenant_id}#{hashlib.md5(query.encode()).hexdigest()}"
# Metadata (searchable)
meta = metadata or {}
meta.update({
"query": query[:500], # Pinecone metadata has size limits
"tenant_id": tenant_id,
"response_length": len(response),
"timestamp": datetime.utcnow().isoformat()
})
# Upsert to Pinecone (insert or update if exists)
self.index.upsert(
vectors=[(vector_id, embedding.tolist(), meta)]
)
def find_similar(self, query_embedding: np.ndarray, tenant_id: str) -> tuple[str, float]:
"""
Search for similar cached responses (scoped to tenant).
Returns: (cached_response, similarity) or None.
"""
# Query Pinecone for top-k nearest neighbors
# Include tenant filter to enforce isolation
results = self.index.query(
vector=query_embedding.tolist(),
top_k=1,
filter={"tenant_id": {"$eq": tenant_id}},
include_metadata=True
)
if not results["matches"] or len(results["matches"]) == 0:
return None
match = results["matches"][0]
similarity = match["score"] # Cosine similarity (0-1)
if similarity < self.threshold:
return None
# Reconstruct response from metadata
# Note: Pinecone metadata is limited; for large responses, store in separate DB
response = match["metadata"].get("response", "")
return response, similarity
def batch_store(self, entries: list[dict], batch_size: int = 100):
"""Efficiently store multiple cache entries."""
vectors = []
for i, entry in enumerate(entries):
vector_id = f"{entry['tenant_id']}#{hashlib.md5(entry['query'].encode()).hexdigest()}"
vectors.append((
vector_id,
entry["embedding"].tolist(),
{
"query": entry["query"][:500],
"tenant_id": entry["tenant_id"],
"response_length": len(entry["response"])
}
))
# Upsert in batches
if (i + 1) % batch_size == 0 or i == len(entries) - 1:
self.index.upsert(vectors=vectors)
vectors = []
def stats(self) -> dict:
"""Get index statistics from Pinecone."""
index_stats = self.index.describe_index_stats()
return {
"total_vectors": index_stats["total_vector_count"],
"namespaces": index_stats["namespaces"], # Tenant isolation
"dimension": index_stats["dimension"]
}
Sharding and Distributed Consistency
At scale (billions of vectors across regions), shard by tenant or by vector hash.
Example: Shard by tenant
class ShardedSemanticCache:
"""Multi-shard semantic cache for global deployment."""
def __init__(self, pinecone_keys: dict[str, str]):
# Map tenant -> Pinecone index
# Ensures one tenant's data does not move between shards
self.shards = {
tenant_id: ScaledSemanticCache(api_key, f"cache-{tenant_id}")
for tenant_id, api_key in pinecone_keys.items()
}
def get_shard(self, tenant_id: str) -> ScaledSemanticCache:
"""Route to correct shard based on tenant."""
if tenant_id not in self.shards:
raise ValueError(f"Unknown tenant: {tenant_id}")
return self.shards[tenant_id]
def store(self, query: str, embedding: np.ndarray, response: str, tenant_id: str):
"""Store in tenant-specific shard."""
shard = self.get_shard(tenant_id)
shard.store(query, embedding, response, tenant_id)
def find_similar(self, query_embedding: np.ndarray, tenant_id: str):
"""Search within tenant's shard only."""
shard = self.get_shard(tenant_id)
return shard.find_similar(query_embedding, tenant_id)
Alternative: Shard by vector hash (for global cache shared across tenants)
def get_shard_index(vector_hash: str, num_shards: int) -> int:
"""Deterministic routing: same vector always goes to same shard."""
return int(vector_hash[:8], 16) % num_shards
# On store: hash embedding, route to shard
# On search: search all shards in parallel (or use a leader/follower pattern)
Handling Consistency and Staleness at Scale
In a distributed system, cache entries may be replicated or delayed. Define consistency guarantees.
Strong consistency (all replicas immediately synchronized):
- Implementation: Write to primary, wait for replication before returning.
- Latency impact:
+50–200 ms per write. - Use case: Financial, medical data where freshness is critical.
Eventual consistency (replicas synchronized within minutes):
- Implementation: Write to primary, replicate asynchronously. Read from nearest replica.
- Latency impact:
+0 ms (best-case), but data may be stale. - Use case: General Q&A, content generation (tolerance for stale cached responses).
Example: Consistency policy
class ConsistentSemanticCache:
"""Semantic cache with configurable consistency guarantees."""
def __init__(self, primary_shard, replica_shards=None, consistency="eventual"):
self.primary = primary_shard
self.replicas = replica_shards or []
self.consistency = consistency # "strong", "eventual", or "local"
def store(self, query: str, embedding: np.ndarray, response: str, tenant_id: str):
"""Store with configurable consistency."""
vector_id = f"{tenant_id}#{hashlib.md5(query.encode()).hexdigest()}"
# Write to primary
self.primary.store(query, embedding, response, tenant_id)
if self.consistency == "strong":
# Wait for replication to all replicas
for replica in self.replicas:
replica.store(query, embedding, response, tenant_id)
elif self.consistency == "eventual":
# Fire-and-forget async replication
for replica in self.replicas:
asyncio.create_task(
replica.store_async(query, embedding, response, tenant_id)
)
Migration: In-Memory to Vector Database
Moving a production cache from memory to Pinecone:
- Shadow traffic (1–2 weeks): Write to both in-memory and Pinecone. Compare responses.
- Gradual migration (1 week): Route 10% of reads to Pinecone, 90% to memory. Monitor latency and correctness.
- Flip traffic (1 day): Switch 100% to Pinecone. Keep in-memory as hot backup.
- Cleanup (1 week): Retire in-memory cache after confirming stability.
Code: Shadow write pattern
class MigrationCache:
"""Dual-write cache during migration."""
def __init__(self, memory_cache, vector_db_cache):
self.memory = memory_cache
self.vector_db = vector_db_cache
def store(self, query, embedding, response, tenant_id):
"""Write to both systems."""
self.memory.store(query, embedding, response, tenant_id)
try:
self.vector_db.store(query, embedding, response, tenant_id)
except Exception as e:
print(f"Vector DB write failed (acceptable during migration): {e}")
def find_similar(self, query_embedding, tenant_id, use_vector_db: bool = False):
"""Read from vector_db or memory based on migration flag."""
if use_vector_db:
return self.vector_db.find_similar(query_embedding, tenant_id)
else:
return self.memory.find_similar(query_embedding, tenant_id)
Key Takeaways
- Vector databases (Pinecone, Weaviate, Milvus) are essential at scale (>10M entries); in-memory no longer practical.
- Pinecone is simplest (managed, sub-100ms, no ops). Milvus is most cost-efficient (self-hosted, billions of vectors).
- Shard by tenant for isolation; replicate for high availability; define consistency guarantees (strong vs. eventual).
- Migrate gradually: shadow traffic → A/B test → full traffic → retire old system.
Frequently Asked Questions
How much does Pinecone cost for 100M vectors?
Pinecone pricing: roughly USD 0.001 per vector per month. For 100M vectors: USD 100K/month. But this includes unlimited queries; the actual cost-benefit depends on query volume and avoided LLM costs. At 40% hit rate and USD 0.003/request LLM cost, 50M requests saves USD 60K/month in LLM costs, so breakeven is ~2 months.
Can I use Postgres pgvector in production?
Yes, if cache size <10M entries and query latency <500ms is acceptable. For higher throughput, switch to Pinecone or Weaviate. Postgres is a great starting point for proof-of-concept.
How do I handle cache eviction in vector databases?
Pinecone and Weaviate support TTL and namespace-level cleanup. Milvus allows partition-based purging. Define a policy: delete entries >30 days old, or delete the least-similar entries when size exceeds limit.
Is cross-region replication necessary?
Only if you serve users globally and latency to a single region is unacceptable (>100 ms). For most use cases, a single region is sufficient; users tolerate 50–100 ms latency if it saves costs.
What if a vector database goes down?
Graceful degradation: if vector DB is unavailable, fall back to LLM inference (cost impact but service continues). Use circuit breakers to detect failures and auto-switch. Replicate to a backup region or use managed services with SLAs.
Further Reading
- Pinecone Architecture and Scaling (2024) — Official scaling patterns.
- Weaviate: Vector Search at Scale — Alternatives to managed services.
- Milvus Documentation: Multi-Node Cluster Setup — Deploying at scale.
- Approximate Nearest Neighbor Search Algorithms (HNSW, IVF) — Academic foundation for scaling vector search.