Vector Database Backup and Disaster Recovery
A vector database contains your most valuable data: billions of embeddings representing your entire knowledge base or user corpus. Loss or corruption could halt your RAG system, recommendation engine, or search product for days. Backup and disaster recovery (DR) strategy defines your Recovery Point Objective (RPO: acceptable data loss) and Recovery Time Objective (RTO: acceptable downtime). Poor DR planning causes catastrophic financial and reputational damage.
RPO and RTO: Defining Your SLA
RPO (Recovery Point Objective): Maximum acceptable data loss time. If RPO is 1 hour, you can afford to lose up to 1 hour of data. If RPO is 1 minute, backups must occur every minute.
RTO (Recovery Time Objective): Maximum acceptable downtime. If RTO is 1 hour, you must restore service within 1 hour of a failure.
Example SLAs:
| Use Case | RPO | RTO |
|---|---|---|
| Non-critical prototype | 24 hours | 24 hours |
| Standard production RAG | 1 hour | 4 hours |
| Customer-facing search | 15 minutes | 1 hour |
| High-availability trading system | 1 minute | 15 minutes |
| Mission-critical (healthcare, finance) | < 1 minute | < 10 minutes |
Tighter RPO/RTO requires more expensive, complex infrastructure.
Backup Types and Strategies
Snapshot backups (point-in-time):
A snapshot is a complete copy of the database state at a specific moment. Typical frequency: daily, weekly.
Advantages:
- Simple to understand: one consistent state, easy to restore.
- No point-in-time recovery needed (restore to the snapshot time).
- Easy to test: restore snapshot to a staging environment.
Limitations:
- RPO is large (if daily snapshots, RPO is 24 hours).
- Storage cost: multiple snapshots can consume 10x+ the database size.
- Restore latency: copying 10 TB snapshot across the network takes hours.
Continuous backups (WAL replication):
Write-Ahead Log (WAL) logs every mutation (upsert, delete). Replay the WAL from a snapshot recovers all data up to the last logged operation.
Advantages:
- RPO is near-zero (1–5 minute lag).
- Storage cost: only log storage (10–20 KB per operation).
- Enables point-in-time recovery: restore to any point within retention window.
Limitations:
- More complex operationally: manage snapshots + WAL streams.
- Restore latency: replay WAL operations sequentially (slow for 1B+ operations).
- Requires robust WAL infrastructure (distributed logging, replication).
Hybrid strategy (recommended):
Daily snapshots + continuous WAL backup. Restore snapshot, then replay WAL to recover lost operations.
Example RTO: snapshot_restore_time (1–2 hours) + WAL_replay_time (15–30 min) = 1.5–2.5 hours RTO
Implementing Snapshots in Qdrant
Qdrant supports native snapshots:
from qdrant_client import QdrantClient
from datetime import datetime
client = QdrantClient("localhost", port=6333)
# Create snapshot
snapshot_response = client.snapshot(
collection_name="documents"
)
snapshot_name = snapshot_response.snapshot_description.name
print(f"Snapshot created: {snapshot_name}")
# Output: snapshots/2026-06-02-12-30-45.snap
# List snapshots
snapshots = client.list_snapshots(collection_name="documents")
for snapshot in snapshots:
print(f"{snapshot.name} ({snapshot.size} bytes) created {snapshot.creation_time}")
# Download snapshot to S3
import boto3
s3 = boto3.client("s3")
# Qdrant snapshots are stored locally; download to S3
with open(f"/var/lib/qdrant/snapshots/{snapshot_name}", "rb") as f:
s3.upload_fileobj(
f,
bucket_name="vector-db-backups",
key=f"qdrant/snapshots/{snapshot_name}"
)
print(f"Snapshot uploaded to S3")
Automated daily snapshots:
import schedule
import time
from datetime import datetime
def backup_snapshot():
client = QdrantClient("localhost", port=6333)
snapshot_response = client.snapshot(collection_name="documents")
snapshot_name = snapshot_response.snapshot_description.name
# Upload to S3
s3.upload_file(
f"/var/lib/qdrant/snapshots/{snapshot_name}",
bucket="vector-db-backups",
key=f"qdrant/daily/{datetime.now().strftime('%Y-%m-%d')}.snap"
)
print(f"Backup complete: {snapshot_name}")
# Schedule daily backup at 2 AM UTC
schedule.every().day.at("02:00").do(backup_snapshot)
while True:
schedule.run_pending()
time.sleep(60)
Point-in-Time Recovery with WAL
Qdrant supports WAL-based recovery via configuration:
# qdrant/config/production.yaml
storage:
snapshots_path: "./snapshots"
wal_path: "./wal" # Write-Ahead Log directory
wal:
wal_capacity_mb: 200 # WAL file size before rotation
wal_segments_ahead: 0 # Keep WAL after snapshots
To recover to a point-in-time:
- Restore the latest snapshot before the desired time.
- Identify WAL files created after the snapshot.
- Replay WAL operations up to the desired timestamp.
from qdrant_client import QdrantClient
from datetime import datetime, timedelta
client = QdrantClient("localhost", port=6333)
# Restore from snapshot
snapshot_time = datetime(2026, 6, 2, 10, 0, 0) # 10 AM UTC
# 1. Delete corrupted collection
client.delete_collection(collection_name="documents")
# 2. Restore snapshot from S3
s3.download_file(
bucket="vector-db-backups",
key="qdrant/snapshots/2026-06-02-09-00-00.snap", # 9 AM snapshot
filename="/var/lib/qdrant/snapshots/restore.snap"
)
# 3. Qdrant restores snapshot on restart
# (Restart container or call recovery API)
# 4. Replay WAL logs from 9 AM to 10 AM
# Qdrant handles this automatically if WAL logs exist
# Or manually apply operations from your application logs
Cross-Region Replication for Disaster Recovery
For mission-critical systems, replicate backups across regions:
import boto3
from datetime import datetime
def replicate_backup_to_other_regions():
"""Copy snapshots to multiple AWS regions."""
source_region = "us-east-1"
target_regions = ["eu-west-1", "ap-southeast-1"]
source_s3 = boto3.client("s3", region_name=source_region)
# List latest snapshot in source
response = source_s3.list_objects_v2(
Bucket="vector-db-backups",
Prefix="qdrant/daily/"
)
latest_snapshot = sorted(response["Contents"], key=lambda x: x["LastModified"])[-1]
# Copy to target regions
for target_region in target_regions:
target_s3 = boto3.client("s3", region_name=target_region)
copy_source = {
"Bucket": "vector-db-backups",
"Key": latest_snapshot["Key"]
}
target_s3.copy_object(
CopySource=copy_source,
Bucket="vector-db-backups-" + target_region,
Key=latest_snapshot["Key"]
)
print(f"Snapshot replicated to {target_region}")
# Schedule replication every 6 hours
schedule.every(6).hours.do(replicate_backup_to_other_regions)
This ensures if an entire AWS region fails, you can restore from another region within minutes.
Testing Disaster Recovery: Runbooks
A backup untested is a disaster waiting to happen. Schedule regular DR drills:
"""
Disaster Recovery Runbook: Full Cluster Failure
Objective: Restore documents collection from snapshot + WAL within RTO (4 hours).
Date: 2026-06-02 (monthly DR drill)
"""
def dr_drill():
# 1. Stop production cluster (intentionally simulate failure)
# kubectl scale deployment qdrant --replicas=0
# 2. Provision new cluster in standby region
# terraform apply -var region=eu-west-1
# 3. Download latest snapshot from backup
s3 = boto3.client("s3")
snapshot_key = "qdrant/daily/2026-06-02.snap"
s3.download_file(
Bucket="vector-db-backups",
Key=snapshot_key,
Filename="/tmp/restore.snap"
)
# 4. Restore snapshot to new cluster
client = QdrantClient("new-cluster.eu-west-1.internal", port=6333)
client.recover_snapshot(
collection_name="documents",
snapshot_path="/tmp/restore.snap"
)
# 5. Verify data integrity
count = client.count(collection_name="documents").count
expected_count = 1_000_000_000 # 1B vectors
assert count == expected_count, f"Count mismatch: {count} vs {expected_count}"
# 6. Run sample searches
for test_query in test_queries:
results = client.search(
collection_name="documents",
query_vector=test_query["embedding"],
limit=10
)
assert len(results) == 10, "Search returned fewer results than expected"
# 7. Validate consistency
# Compare search results on restored cluster vs. original
# within +/- 1% recall tolerance
# 8. Measure RTO and document results
elapsed = time.time() - start
print(f"DR drill complete: RTO = {elapsed:.0f}s ({elapsed/3600:.2f} hours)")
# 9. Rollback to original cluster
# kubectl scale deployment qdrant --replicas=3
print("✓ DR drill completed successfully")
print(f"✓ RTO achieved: 2.5 hours (SLA: 4 hours) ✓ PASS")
Run this monthly. Document failures and refine the process.
Backup Storage and Retention
Storage location: Use S3-compatible storage with encryption:
# Backup with encryption
import hashlib
def backup_encrypted(client, collection_name, bucket, key_prefix):
snapshot = client.snapshot(collection_name=collection_name)
# Calculate checksum
file_path = f"/var/lib/qdrant/snapshots/{snapshot.snapshot_description.name}"
with open(file_path, "rb") as f:
checksum = hashlib.sha256(f.read()).hexdigest()
# Upload with encryption
s3.upload_file(
file_path,
bucket,
f"{key_prefix}/{snapshot.snapshot_description.name}",
ServerSideEncryption="AES256"
)
# Store checksum for integrity verification
s3.put_object(
Bucket=bucket,
Key=f"{key_prefix}/{snapshot.snapshot_description.name}.sha256",
Body=checksum
)
Retention policy: Delete old backups after N days to control costs:
def cleanup_old_backups(bucket, retention_days=30):
"""Delete snapshots older than retention_days."""
s3 = boto3.client("s3")
cutoff_date = datetime.now() - timedelta(days=retention_days)
response = s3.list_objects_v2(Bucket=bucket, Prefix="qdrant/daily/")
for obj in response.get("Contents", []):
if obj["LastModified"].replace(tzinfo=None) < cutoff_date:
s3.delete_object(Bucket=bucket, Key=obj["Key"])
print(f"Deleted old backup: {obj['Key']}")
Key Takeaways
- Define RPO and RTO based on business impact. Tighter SLAs require more complex, expensive infrastructure.
- Combine daily snapshots + continuous WAL backup for balanced cost and recovery capability.
- Test DR procedures monthly. A runbook untested is worthless.
- Store backups in a separate, geographically-distributed system (S3 across regions).
- Encrypt backups at rest and verify checksums on restore.
Frequently Asked Questions
How long should a snapshot take for 1B vectors?
A 1B-vector index (assuming 384-dim embeddings) is ~12 TB. Snapshot time depends on disk I/O and network. Expect 30 minutes–2 hours for local disk snapshot, 2–6 hours for network-backed snapshot (EBS, cloud storage).
What is the cost of backing up 1B vectors daily?
Storage cost (S3): 12 TB × 30 days × $0.023/GB = ~$8k/month for 30-day retention. Plus egress if restoring from a different region. Snapshots are expensive; consider longer retention intervals (weekly) or incremental/differential backups to reduce costs.
Can I restore a single point or collection, or must I restore the entire database?
Most vector databases support collection-level snapshots (restore one collection without affecting others). Point-level recovery is rarely supported; you would restore a snapshot and manually delete unwanted points.
What if my WAL logs are corrupted?
You lose point-in-time recovery. Your only option is to restore from the last clean snapshot (accepting data loss up to that point). This is why cross-region replication is critical: if your local backup is corrupted, you can restore from a replicated copy in another region or cloud account.