Skip to main content

Vector DB Cost Optimization and Multi-Tenancy

Vector database costs can spiral quickly. A 1-billion vector index on Pinecone costs $2.6k/month at 100 QPS. The same workload on self-hosted Milvus costs $5k/month in infrastructure. For SaaS products, hosting a separate vector database per tenant is prohibitively expensive. Understanding cost drivers and designing efficient multi-tenant architectures can reduce spending by 50% or more while maintaining performance and security isolation.

Cost Drivers: Where Money Goes

Managed (Pinecone):

  • Per-vector storage: $0.10 per month per 1M vectors.
  • Query throughput: $0.00010 per 1,000 reads.
  • Write operations: Usually included or cheaper than reads.
  • Metadata filtering: Included, no extra cost.

Example: 100M vectors at 100 QPS (8.6M queries/day, ~260M queries/month).

  • Storage: 100M × $0.10 / 1M = $10/month.
  • Reads: 260M × $0.00010 / 1000 = $26/month.
  • Total: ~$36/month (surprisingly cheap, but scales quickly).

100B vectors at 1k QPS:

  • Storage: 100B × $0.10 / 1M = $10k/month.
  • Reads: 26B × $0.00010 / 1000 = $2.6k/month.
  • Total: $12.6k/month (expensive!).

Self-hosted (on Kubernetes):

  • Compute: Milvus requires CPU + RAM for indexing and search. 16 CPU + 128 GB RAM ~ $2k/month per node (AWS EC2 on-demand).
  • Storage: EBS volumes. ~$0.10 per GB per month. 1.5 TB index = $150/month (but you need redundancy, so 3 replicas = $450/month).
  • Network: Egress is expensive. $0.12 per GB beyond free tier. 1 TB egress/month = $120/month.
  • Total for 1B vectors: 10 nodes (100M vectors each) × $2.5k/month = $25k/month.

Conclusion: Self-hosted is cheaper at 100B+ scale but requires operational overhead.

Cost Reduction Strategy 1: Quantization

Quantization reduces storage and query costs dramatically:

QuantizationStorage (1B vectors, 384-dim)Pinecone Storage CostSpeedupRecall Loss
Float321.5 TB$150/monthBaseline0%
INT8375 GB$37.50/month2–3x< 5%
PQ (8-byte)8 GB$0.80/month5–10x10–15%

Switching 100M vectors from float32 to INT8 saves $112.50/month (75% reduction). For 1B vectors, savings are $1.1k/month.

# Enable quantization in Qdrant
from qdrant_client.models import ScalarQuantization, QuantizationConfig

client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
quantization_config=QuantizationConfig(
scalar=ScalarQuantization(type="int8", always_ram=False)
),
)

# Cost: same ingestion, 4x lower storage cost

Cost Reduction Strategy 2: Batch Queries and Caching

Batch multiple user queries into a single database query:

from qdrant_client import QdrantClient
import functools

class CachingVectorClient:
def __init__(self, client):
self.client = client
self.cache = {} # Query vector (as tuple) -> results

def search(self, query_vector, limit=10, use_cache=True):
"""Search with caching."""

# Normalize query for cache key
key = tuple(query_vector)

if use_cache and key in self.cache:
return self.cache[key]

# Query database
results = self.client.search(
collection_name="documents",
query_vector=query_vector,
limit=limit
)

# Cache result
self.cache[key] = results
return results

def batch_search(self, query_vectors, limit=10):
"""
Batch search: amortize query cost across multiple queries.
Some DBs support batch operations; if not, parallelize.
"""
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=8) as executor:
results = list(executor.map(
functools.partial(self.search, limit=limit, use_cache=True),
query_vectors
))

return results

# Usage: cache reduces redundant queries
client = CachingVectorClient(qdrant_client)

# Same query twice -> second is free (cached)
results1 = client.search([0.1, 0.2, 0.3, ...])
results2 = client.search([0.1, 0.2, 0.3, ...]) # Cache hit

Cost impact: Cache hit rate of 20% reduces query volume by 20%, saving 20% of query costs.

Cost Reduction Strategy 3: Tiered Storage

Store hot data (frequently accessed) in the main index; cold data (archival) in cheaper storage:

class TieredVectorStorage:
def __init__(self, hot_client, cold_storage_s3):
self.hot = hot_client # Qdrant, with expensive storage
self.cold = cold_storage_s3 # S3 with cheap storage

def search(self, query_vector, limit=10):
"""Search hot tier first, fall back to cold if needed."""

# Search hot tier
hot_results = self.hot.search(
collection_name="documents_hot",
query_vector=query_vector,
limit=limit
)

# If results are < limit, search cold tier
if len(hot_results) < limit:
# Retrieve cold data from S3 (slower)
cold_results = self._search_cold(query_vector, limit=limit - len(hot_results))

# Merge and re-rank
all_results = hot_results + cold_results
all_results.sort(key=lambda x: x.distance)
return all_results[:limit]

return hot_results

def _search_cold(self, query_vector, limit):
"""Search cold storage (S3). Much slower, but cheaper."""
# Load cold index from S3, search, cache result temporarily
# (Simplified: assume cold_vectors are pre-indexed)

cold_vectors = self._load_from_s3()
distances = cdist([query_vector], cold_vectors)[0]

top_ids = np.argsort(distances)[:limit]
return [{"id": i, "distance": distances[i]} for i in top_ids]

# Usage
tiered = TieredVectorStorage(hot_qdrant, cold_s3)

# New documents go to S3 (cold tier)
tiered.cold.put_object(
Bucket="cold-vectors",
Key="documents/2024-01-01.parquet",
Body=serialize(old_vectors)
)

# Recent documents are in hot tier (Qdrant)
# Cost: hot tier is expensive but small (1M vectors = $10/month)
# Cold tier is cheap (100M vectors archived = $1/month)

Cost Reduction Strategy 4: Multi-Tenancy and Resource Sharing

For SaaS, hosting separate vector DBs per tenant is expensive. Instead, share a single database across tenants with logical isolation:

Approach 1: Collections per tenant (simple isolation):

class MultiTenantVectorDB:
def __init__(self, client):
self.client = client

def get_tenant_collection(self, tenant_id):
"""Get or create collection for tenant."""
collection_name = f"documents_tenant_{tenant_id}"

# Create if not exists
try:
self.client.get_collection(collection_name)
except:
self.client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

return collection_name

def upsert_for_tenant(self, tenant_id, points):
"""Upsert points for a tenant."""
collection = self.get_tenant_collection(tenant_id)
self.client.upsert(collection_name=collection, points=points)

def search_for_tenant(self, tenant_id, query_vector, limit=10):
"""Search within tenant's collection."""
collection = self.get_tenant_collection(tenant_id)
return self.client.search(
collection_name=collection,
query_vector=query_vector,
limit=limit
)

# Cost model:
# - Pinecone: $10/month per tenant (small) + query costs
# - Self-hosted: 1 shared cluster for all tenants (amortized)

db = MultiTenantVectorDB(client)

# Tenant A and B each have separate collections
db.upsert_for_tenant("tenant_a", points_a)
db.upsert_for_tenant("tenant_b", points_b)

# Queries are isolated per collection
results_a = db.search_for_tenant("tenant_a", query)
results_b = db.search_for_tenant("tenant_b", query)

Approach 2: Shared collection with payload-based isolation (compact):

class SharedMultiTenantVectorDB:
def __init__(self, client):
self.client = client
self.collection_name = "documents_multi_tenant"

def upsert_for_tenant(self, tenant_id, points):
"""Upsert with tenant_id in payload."""
for point in points:
point.payload["tenant_id"] = tenant_id

self.client.upsert(
collection_name=self.collection_name,
points=points
)

def search_for_tenant(self, tenant_id, query_vector, limit=10):
"""Search with tenant isolation filter."""
from qdrant_client.models import Filter, FieldCondition, MatchValue

results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
query_filter=Filter(
must=[FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id))]
),
limit=limit
)
return results

# Cost model:
# Single collection shared across all tenants
# Cost: 1 index for all data (e.g., $100/month for 1B vectors across 1000 tenants)
# Per-tenant cost: $0.10/month (amortized)

db = SharedMultiTenantVectorDB(client)

# All data in one collection, but filtered by tenant_id at query time
db.upsert_for_tenant("tenant_a", points_a)
db.upsert_for_tenant("tenant_b", points_b)

results_a = db.search_for_tenant("tenant_a", query) # Only tenant_a results
results_b = db.search_for_tenant("tenant_b", query) # Only tenant_b results

Isolation comparison:

Isolation TypeStorage CostMetadata FilteringSecurityQuery Latency
Separate collectionsHigh (separate indexes)Native, fastExcellentFast (small index)
Shared collection + filterLow (1 index)Via payload filterGood (at query time)Slower (filter + search)
Separate databasesVery high (separate clusters)NativeExcellentFast

Recommendation: For < 1000 tenants, use shared collection. For > 1000 tenants or high-security requirements, use separate collections.

Cost Reduction Strategy 5: Batch Ingestion and Scheduled Indexing

Instead of real-time ingestion (expensive re-indexing), batch vectors and re-index nightly:

class BatchIngestionVectorDB:
def __init__(self, client, batch_dir):
self.client = client
self.batch_dir = batch_dir

def ingest_async(self, vectors, metadata):
"""Queue vectors for batch ingestion (async)."""
import uuid
batch_id = str(uuid.uuid4())

# Write to disk
with open(f"{self.batch_dir}/{batch_id}.parquet", "wb") as f:
df = pd.DataFrame({"vector": vectors, **metadata})
df.to_parquet(f)

return batch_id # Acknowledge immediately

def process_batch(self):
"""Process all queued batches (run nightly)."""
import glob

# Disable indexing during load
self.client.update_collection(
collection_name="documents",
hnsw_config=HnswConfigDiff(ef_construct=0) # Disable re-indexing
)

# Bulk load all batches
for batch_file in glob.glob(f"{self.batch_dir}/*.parquet"):
df = pd.read_parquet(batch_file)

points = [
PointStruct(id=i, vector=row["vector"], payload={...})
for i, row in df.iterrows()
]

self.client.upsert(collection_name="documents", points=points)

# Delete batch file
os.remove(batch_file)

# Re-enable indexing
self.client.update_collection(
collection_name="documents",
hnsw_config=HnswConfigDiff(ef_construct=200)
)

# Cost model:
# Real-time ingestion: 100K vec/sec, costs multiply-factor due to re-indexing
# Batch ingestion: 1M vec/sec (10x faster), single index rebuild per day
# Daily cost: same, but throughput 10x higher -> cost per vector 10x lower

db = BatchIngestionVectorDB(client, "/tmp/vector_batches")

# Users ingest vectors asynchronously (immediate return)
db.ingest_async(vectors, metadata)

# Nightly batch process (low-cost, off-peak)
schedule.every().day.at("02:00").do(db.process_batch)

Cost impact: Batch ingestion can reduce write costs by 50–70% by amortizing indexing.

Comparing Cost Models

Scenario: SaaS RAG system with 100M vectors, 1000 QPS, 100 tenants.

ModelMonthly CostCost per Tenant
Managed (Pinecone), 1 shared collection$10 (storage) + $100 (QPS) = $110$1.10
Managed (Pinecone), 1 per tenant$10k (100 × $100) = $10k$100
Self-hosted (Milvus), shared cluster$5k (infrastructure) = $5k$50
Self-hosted + Quantization$2k (half storage) = $2k$20
Self-hosted + Batch + Tiering$1.5k (optimized) = $1.5k$15

Conclusion: Smart architecture (shared, quantized, batched) reduces costs 5–10x.

Key Takeaways

  • Quantization (INT8) cuts storage costs by 75%; PQ by 95%+, with minimal recall loss.
  • Caching and batching reduce query/write throughput by 20–50%, cutting egress costs.
  • Shared multi-tenant collections (with payload filtering) reduce per-tenant costs 10–100x.
  • Batch ingestion nightly (vs. real-time) reduces write costs 50–70%.
  • Self-hosted is cheaper at 100B+ scale; managed is cheaper for startups (< 1B vectors).

Frequently Asked Questions

Should I use Pinecone or self-hosted to minimize cost?

Below 100M vectors: Pinecone (minimal ops cost). 100M–1B vectors: Pinecone if you value simplicity, or self-hosted if ops team is available. Above 1B vectors: Self-hosted (Milvus) is 50–70% cheaper.

How much can I save by multi-tenancy?

If you currently host a separate index per tenant, switching to a shared collection can reduce storage costs 10–50x and query costs 5–10x, depending on tenant size distribution.

Is payload filtering slower than separate collections?

Yes, 10–20% slower. Payload filtering requires evaluating the filter predicate during search. If speed is critical, use separate collections. If cost is critical, pay the latency penalty.

How do I estimate cost for a new workload?

  1. Estimate vectors: 1M docs × 5 chunks = 5M vectors.
  2. Estimate QPS: 100 users × 10 queries/day = 1k queries/day = 0.01 QPS.
  3. Pinecone: 5M × $0.10 / 1M + 1k × $0.00010 / 1000 × 30 days = $0.50 + $0.03 = $0.53/month.

Multiply baseline by scale factor as you grow.

Further Reading