Vector DB Cost Optimization and Multi-Tenancy
Vector database costs can spiral quickly. A 1-billion vector index on Pinecone costs $2.6k/month at 100 QPS. The same workload on self-hosted Milvus costs $5k/month in infrastructure. For SaaS products, hosting a separate vector database per tenant is prohibitively expensive. Understanding cost drivers and designing efficient multi-tenant architectures can reduce spending by 50% or more while maintaining performance and security isolation.
Cost Drivers: Where Money Goes
Managed (Pinecone):
- Per-vector storage:
$0.10 per month per 1M vectors. - Query throughput:
$0.00010 per 1,000 reads. - Write operations: Usually included or cheaper than reads.
- Metadata filtering: Included, no extra cost.
Example: 100M vectors at 100 QPS (8.6M queries/day, ~260M queries/month).
- Storage:
100M × $0.10 / 1M = $10/month. - Reads:
260M × $0.00010 / 1000 = $26/month. - Total: ~
$36/month(surprisingly cheap, but scales quickly).
100B vectors at 1k QPS:
- Storage:
100B × $0.10 / 1M = $10k/month. - Reads:
26B × $0.00010 / 1000 = $2.6k/month. - Total:
$12.6k/month(expensive!).
Self-hosted (on Kubernetes):
- Compute: Milvus requires CPU + RAM for indexing and search.
16 CPU + 128 GB RAM ~ $2k/month per node(AWS EC2 on-demand). - Storage: EBS volumes.
~$0.10 per GB per month. 1.5 TB index =$150/month(but you need redundancy, so 3 replicas =$450/month). - Network: Egress is expensive.
$0.12 per GBbeyond free tier. 1 TB egress/month =$120/month. - Total for 1B vectors: 10 nodes (100M vectors each) ×
$2.5k/month = $25k/month.
Conclusion: Self-hosted is cheaper at 100B+ scale but requires operational overhead.
Cost Reduction Strategy 1: Quantization
Quantization reduces storage and query costs dramatically:
| Quantization | Storage (1B vectors, 384-dim) | Pinecone Storage Cost | Speedup | Recall Loss |
|---|---|---|---|---|
| Float32 | 1.5 TB | $150/month | Baseline | 0% |
| INT8 | 375 GB | $37.50/month | 2–3x | < 5% |
| PQ (8-byte) | 8 GB | $0.80/month | 5–10x | 10–15% |
Switching 100M vectors from float32 to INT8 saves $112.50/month (75% reduction). For 1B vectors, savings are $1.1k/month.
# Enable quantization in Qdrant
from qdrant_client.models import ScalarQuantization, QuantizationConfig
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
quantization_config=QuantizationConfig(
scalar=ScalarQuantization(type="int8", always_ram=False)
),
)
# Cost: same ingestion, 4x lower storage cost
Cost Reduction Strategy 2: Batch Queries and Caching
Batch multiple user queries into a single database query:
from qdrant_client import QdrantClient
import functools
class CachingVectorClient:
def __init__(self, client):
self.client = client
self.cache = {} # Query vector (as tuple) -> results
def search(self, query_vector, limit=10, use_cache=True):
"""Search with caching."""
# Normalize query for cache key
key = tuple(query_vector)
if use_cache and key in self.cache:
return self.cache[key]
# Query database
results = self.client.search(
collection_name="documents",
query_vector=query_vector,
limit=limit
)
# Cache result
self.cache[key] = results
return results
def batch_search(self, query_vectors, limit=10):
"""
Batch search: amortize query cost across multiple queries.
Some DBs support batch operations; if not, parallelize.
"""
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=8) as executor:
results = list(executor.map(
functools.partial(self.search, limit=limit, use_cache=True),
query_vectors
))
return results
# Usage: cache reduces redundant queries
client = CachingVectorClient(qdrant_client)
# Same query twice -> second is free (cached)
results1 = client.search([0.1, 0.2, 0.3, ...])
results2 = client.search([0.1, 0.2, 0.3, ...]) # Cache hit
Cost impact: Cache hit rate of 20% reduces query volume by 20%, saving 20% of query costs.
Cost Reduction Strategy 3: Tiered Storage
Store hot data (frequently accessed) in the main index; cold data (archival) in cheaper storage:
class TieredVectorStorage:
def __init__(self, hot_client, cold_storage_s3):
self.hot = hot_client # Qdrant, with expensive storage
self.cold = cold_storage_s3 # S3 with cheap storage
def search(self, query_vector, limit=10):
"""Search hot tier first, fall back to cold if needed."""
# Search hot tier
hot_results = self.hot.search(
collection_name="documents_hot",
query_vector=query_vector,
limit=limit
)
# If results are < limit, search cold tier
if len(hot_results) < limit:
# Retrieve cold data from S3 (slower)
cold_results = self._search_cold(query_vector, limit=limit - len(hot_results))
# Merge and re-rank
all_results = hot_results + cold_results
all_results.sort(key=lambda x: x.distance)
return all_results[:limit]
return hot_results
def _search_cold(self, query_vector, limit):
"""Search cold storage (S3). Much slower, but cheaper."""
# Load cold index from S3, search, cache result temporarily
# (Simplified: assume cold_vectors are pre-indexed)
cold_vectors = self._load_from_s3()
distances = cdist([query_vector], cold_vectors)[0]
top_ids = np.argsort(distances)[:limit]
return [{"id": i, "distance": distances[i]} for i in top_ids]
# Usage
tiered = TieredVectorStorage(hot_qdrant, cold_s3)
# New documents go to S3 (cold tier)
tiered.cold.put_object(
Bucket="cold-vectors",
Key="documents/2024-01-01.parquet",
Body=serialize(old_vectors)
)
# Recent documents are in hot tier (Qdrant)
# Cost: hot tier is expensive but small (1M vectors = $10/month)
# Cold tier is cheap (100M vectors archived = $1/month)
Cost Reduction Strategy 4: Multi-Tenancy and Resource Sharing
For SaaS, hosting separate vector DBs per tenant is expensive. Instead, share a single database across tenants with logical isolation:
Approach 1: Collections per tenant (simple isolation):
class MultiTenantVectorDB:
def __init__(self, client):
self.client = client
def get_tenant_collection(self, tenant_id):
"""Get or create collection for tenant."""
collection_name = f"documents_tenant_{tenant_id}"
# Create if not exists
try:
self.client.get_collection(collection_name)
except:
self.client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
return collection_name
def upsert_for_tenant(self, tenant_id, points):
"""Upsert points for a tenant."""
collection = self.get_tenant_collection(tenant_id)
self.client.upsert(collection_name=collection, points=points)
def search_for_tenant(self, tenant_id, query_vector, limit=10):
"""Search within tenant's collection."""
collection = self.get_tenant_collection(tenant_id)
return self.client.search(
collection_name=collection,
query_vector=query_vector,
limit=limit
)
# Cost model:
# - Pinecone: $10/month per tenant (small) + query costs
# - Self-hosted: 1 shared cluster for all tenants (amortized)
db = MultiTenantVectorDB(client)
# Tenant A and B each have separate collections
db.upsert_for_tenant("tenant_a", points_a)
db.upsert_for_tenant("tenant_b", points_b)
# Queries are isolated per collection
results_a = db.search_for_tenant("tenant_a", query)
results_b = db.search_for_tenant("tenant_b", query)
Approach 2: Shared collection with payload-based isolation (compact):
class SharedMultiTenantVectorDB:
def __init__(self, client):
self.client = client
self.collection_name = "documents_multi_tenant"
def upsert_for_tenant(self, tenant_id, points):
"""Upsert with tenant_id in payload."""
for point in points:
point.payload["tenant_id"] = tenant_id
self.client.upsert(
collection_name=self.collection_name,
points=points
)
def search_for_tenant(self, tenant_id, query_vector, limit=10):
"""Search with tenant isolation filter."""
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
query_filter=Filter(
must=[FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id))]
),
limit=limit
)
return results
# Cost model:
# Single collection shared across all tenants
# Cost: 1 index for all data (e.g., $100/month for 1B vectors across 1000 tenants)
# Per-tenant cost: $0.10/month (amortized)
db = SharedMultiTenantVectorDB(client)
# All data in one collection, but filtered by tenant_id at query time
db.upsert_for_tenant("tenant_a", points_a)
db.upsert_for_tenant("tenant_b", points_b)
results_a = db.search_for_tenant("tenant_a", query) # Only tenant_a results
results_b = db.search_for_tenant("tenant_b", query) # Only tenant_b results
Isolation comparison:
| Isolation Type | Storage Cost | Metadata Filtering | Security | Query Latency |
|---|---|---|---|---|
| Separate collections | High (separate indexes) | Native, fast | Excellent | Fast (small index) |
| Shared collection + filter | Low (1 index) | Via payload filter | Good (at query time) | Slower (filter + search) |
| Separate databases | Very high (separate clusters) | Native | Excellent | Fast |
Recommendation: For < 1000 tenants, use shared collection. For > 1000 tenants or high-security requirements, use separate collections.
Cost Reduction Strategy 5: Batch Ingestion and Scheduled Indexing
Instead of real-time ingestion (expensive re-indexing), batch vectors and re-index nightly:
class BatchIngestionVectorDB:
def __init__(self, client, batch_dir):
self.client = client
self.batch_dir = batch_dir
def ingest_async(self, vectors, metadata):
"""Queue vectors for batch ingestion (async)."""
import uuid
batch_id = str(uuid.uuid4())
# Write to disk
with open(f"{self.batch_dir}/{batch_id}.parquet", "wb") as f:
df = pd.DataFrame({"vector": vectors, **metadata})
df.to_parquet(f)
return batch_id # Acknowledge immediately
def process_batch(self):
"""Process all queued batches (run nightly)."""
import glob
# Disable indexing during load
self.client.update_collection(
collection_name="documents",
hnsw_config=HnswConfigDiff(ef_construct=0) # Disable re-indexing
)
# Bulk load all batches
for batch_file in glob.glob(f"{self.batch_dir}/*.parquet"):
df = pd.read_parquet(batch_file)
points = [
PointStruct(id=i, vector=row["vector"], payload={...})
for i, row in df.iterrows()
]
self.client.upsert(collection_name="documents", points=points)
# Delete batch file
os.remove(batch_file)
# Re-enable indexing
self.client.update_collection(
collection_name="documents",
hnsw_config=HnswConfigDiff(ef_construct=200)
)
# Cost model:
# Real-time ingestion: 100K vec/sec, costs multiply-factor due to re-indexing
# Batch ingestion: 1M vec/sec (10x faster), single index rebuild per day
# Daily cost: same, but throughput 10x higher -> cost per vector 10x lower
db = BatchIngestionVectorDB(client, "/tmp/vector_batches")
# Users ingest vectors asynchronously (immediate return)
db.ingest_async(vectors, metadata)
# Nightly batch process (low-cost, off-peak)
schedule.every().day.at("02:00").do(db.process_batch)
Cost impact: Batch ingestion can reduce write costs by 50–70% by amortizing indexing.
Comparing Cost Models
Scenario: SaaS RAG system with 100M vectors, 1000 QPS, 100 tenants.
| Model | Monthly Cost | Cost per Tenant |
|---|---|---|
| Managed (Pinecone), 1 shared collection | $10 (storage) + $100 (QPS) = $110 | $1.10 |
| Managed (Pinecone), 1 per tenant | $10k (100 × $100) = $10k | $100 |
| Self-hosted (Milvus), shared cluster | $5k (infrastructure) = $5k | $50 |
| Self-hosted + Quantization | $2k (half storage) = $2k | $20 |
| Self-hosted + Batch + Tiering | $1.5k (optimized) = $1.5k | $15 |
Conclusion: Smart architecture (shared, quantized, batched) reduces costs 5–10x.
Key Takeaways
- Quantization (INT8) cuts storage costs by 75%; PQ by 95%+, with minimal recall loss.
- Caching and batching reduce query/write throughput by 20–50%, cutting egress costs.
- Shared multi-tenant collections (with payload filtering) reduce per-tenant costs 10–100x.
- Batch ingestion nightly (vs. real-time) reduces write costs 50–70%.
- Self-hosted is cheaper at 100B+ scale; managed is cheaper for startups (< 1B vectors).
Frequently Asked Questions
Should I use Pinecone or self-hosted to minimize cost?
Below 100M vectors: Pinecone (minimal ops cost). 100M–1B vectors: Pinecone if you value simplicity, or self-hosted if ops team is available. Above 1B vectors: Self-hosted (Milvus) is 50–70% cheaper.
How much can I save by multi-tenancy?
If you currently host a separate index per tenant, switching to a shared collection can reduce storage costs 10–50x and query costs 5–10x, depending on tenant size distribution.
Is payload filtering slower than separate collections?
Yes, 10–20% slower. Payload filtering requires evaluating the filter predicate during search. If speed is critical, use separate collections. If cost is critical, pay the latency penalty.
How do I estimate cost for a new workload?
- Estimate vectors: 1M docs × 5 chunks = 5M vectors.
- Estimate QPS: 100 users × 10 queries/day = 1k queries/day = 0.01 QPS.
- Pinecone:
5M × $0.10 / 1M + 1k × $0.00010 / 1000 × 30 days = $0.50 + $0.03 = $0.53/month.
Multiply baseline by scale factor as you grow.