Vector DB Monitoring and Observability: Guide
Production vector databases are opaque black boxes if not instrumented. Users complain of slow search, but is it a network issue, a query timeout, or poor index quality? Lack of observability causes frustration, delays, and cascading failures. Comprehensive monitoring of latency, recall, throughput, and resource metrics enables you to detect and fix problems before they impact users.
Core Metrics to Track
Latency metrics (SLA-critical):
- p50 search latency: Median query time. Normal: 5–50 ms.
- p95 search latency: 95th percentile. Alert if > 2x baseline.
- p99 search latency: 99th percentile. Alert if > 3x baseline.
- Upsert latency: 90th percentile should be < 100 ms for batched upserts.
Accuracy metrics (correctness):
- Recall: Proportion of true k-nearest neighbors returned. Target: > 95% for HNSW, > 90% for IVF.
- False-positive rate: Vectors returned that should not match. Target: < 1%.
Throughput metrics (capacity):
- Queries per second (QPS): Current load. Alert if > 80% of peak capacity.
- Upserts per second: Current write load.
- Average batch size: Upserts are more efficient in larger batches.
Resource metrics (infrastructure health):
- Memory usage: Alert if > 80% of node memory.
- Disk I/O: Monitor read/write latency. Alert on sustained high I/O.
- CPU usage: Alert if > 70% sustained.
- Network bandwidth: Monitor ingress/egress. Alert on saturation.
Index health metrics (quality):
- Index size: Track size of vector index structures.
- Time since last rebuild: Alert if index is stale and degraded recall.
- Replication lag: For sharded systems, alert if replicas lag leader by > 10s.
Implementing Metrics Collection: Prometheus + Grafana
Most vector databases expose metrics in Prometheus format:
# qdrant/config/production.yaml
api_key: "secret-key"
service:
http_port: 6333
grpc_port: 6334
# Enable metrics endpoint
telemetry:
metrics:
enabled: true
port: 8081 # Prometheus metrics on :8081/metrics
Prometheus scrapes metrics every 15 seconds:
# prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'vector-db'
static_configs:
- targets: ['qdrant-prod.internal:8081']
labels:
env: 'production'
service: 'qdrant'
Example metrics available:
# Qdrant example Prometheus metrics
qdrant_http_request_duration_seconds{quantile="0.5", endpoint="/search"} 0.012
qdrant_http_request_duration_seconds{quantile="0.95", endpoint="/search"} 0.045
qdrant_http_request_duration_seconds{quantile="0.99", endpoint="/search"} 0.089
qdrant_index_operations_total{operation="upsert", collection="documents"} 1_234_567
qdrant_collection_vectors_total{collection="documents"} 1_000_000_000
qdrant_collection_memory_bytes{collection="documents"} 1.2e12
qdrant_request_failures_total{endpoint="/search", reason="timeout"} 42
Visualize in Grafana:
"""
Grafana Dashboard: Vector DB Production Health
Queries:
1. Search Latency (p50, p95, p99):
- histogram_quantile(0.50, qdrant_http_request_duration_seconds{endpoint="/search"})
- histogram_quantile(0.95, qdrant_http_request_duration_seconds{endpoint="/search"})
- histogram_quantile(0.99, qdrant_http_request_duration_seconds{endpoint="/search"})
2. QPS (queries per second):
- rate(qdrant_http_requests_total{endpoint="/search"}[1m])
3. Upsert Throughput:
- rate(qdrant_index_operations_total{operation="upsert"}[1m])
4. Memory Usage:
- qdrant_collection_memory_bytes / 1e9 # Convert to GB
5. Index Operations (insert/update/delete):
- rate(qdrant_index_operations_total[5m])
6. Error Rate:
- rate(qdrant_request_failures_total[5m])
"""
Alert rules (in Prometheus):
# prometheus/alert_rules.yml
groups:
- name: vector_db_alerts
rules:
- alert: VectorDBHighLatency
expr: histogram_quantile(0.99, qdrant_http_request_duration_seconds{endpoint="/search"}) > 0.1
for: 5m
annotations:
summary: "Vector DB p99 search latency > 100ms"
- alert: VectorDBHighErrorRate
expr: rate(qdrant_request_failures_total[5m]) > 0.01 # > 1% errors
for: 2m
annotations:
summary: "Vector DB error rate > 1%"
- alert: VectorDBHighMemory
expr: qdrant_collection_memory_bytes / 1e9 > 900 # > 900 GB
for: 10m
annotations:
summary: "Vector DB memory usage > 90%"
- alert: VectorDBReplicationLag
expr: qdrant_collection_replication_lag_seconds > 10
for: 5m
annotations:
summary: "Vector DB replica lag > 10s"
Recall Testing: Continuous Measurement
Recall is the hardest metric to measure. True nearest neighbors require expensive brute-force search (comparing query to all vectors). Test on a subset:
import random
from qdrant_client import QdrantClient
class RecallTester:
def __init__(self, client, collection_name, test_size=1000):
self.client = client
self.collection_name = collection_name
self.test_size = test_size
def measure_recall(self, k=100):
"""
Measure recall by comparing HNSW search to brute-force exact search
on a random subset of the database.
"""
# Sample random vectors
count = self.client.count(collection_name=self.collection_name).count
sampled_ids = random.sample(range(count), min(self.test_size, count))
recalls = []
for point_id in sampled_ids:
# Get the vector at point_id
point = self.client.get_point(
collection_name=self.collection_name,
point_id=point_id
)
query_vector = point.vector
# Approximate search (HNSW with default ef_search)
approx_results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
limit=k
)
approx_ids = {r.id for r in approx_results}
# Exact search (brute-force, may be slow!)
# Note: Most vector DBs don't expose exact search API
# Workaround: Use a secondary exact search library
# For this example, we'll simulate it
exact_ids = self._brute_force_search(query_vector, k)
# Compute recall
recall = len(approx_ids & exact_ids) / len(exact_ids)
recalls.append(recall)
avg_recall = sum(recalls) / len(recalls)
min_recall = min(recalls)
return {
"avg_recall": avg_recall,
"min_recall": min_recall,
"recall_p50": sorted(recalls)[len(recalls) // 2],
}
def _brute_force_search(self, query_vector, k):
"""
Brute-force exact search using cosine distance.
WARNING: This is slow for large collections (millions of vectors).
Use only on test subsets.
"""
import numpy as np
# Fetch ALL vectors (expensive!)
# Better: maintain a separate index for testing
# Or use approximate search on a subset and call that "ground truth"
# For now, stub implementation
# In production, maintain a small exact-search index for recall testing
return set()
# Continuous recall testing (hourly)
tester = RecallTester(client, "documents", test_size=100)
def test_recall():
metrics = tester.measure_recall(k=100)
# Log to observability system
prometheus_registry.gauge("vector_db_recall_avg", metrics["avg_recall"])
prometheus_registry.gauge("vector_db_recall_min", metrics["min_recall"])
# Alert if recall drops
if metrics["avg_recall"] < 0.95:
logging.error(f"Recall dropped below 95%: {metrics['avg_recall']}")
alert("vector-db-recall-drop", f"Avg recall: {metrics['avg_recall']}")
# Schedule hourly
schedule.every().hour.do(test_recall)
Logging: Structured Logs for Debugging
Log with structured fields for easy filtering and debugging:
import logging
import json
from datetime import datetime
class JSONLogFormatter(logging.Formatter):
def format(self, record):
log_obj = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
"module": record.module,
"function": record.funcName,
"line": record.lineno,
}
# Add extra fields
if hasattr(record, "extra"):
log_obj.update(record.extra)
return json.dumps(log_obj)
# Configure logging
logger = logging.getLogger("vector-db")
handler = logging.StreamHandler()
handler.setFormatter(JSONLogFormatter())
logger.addHandler(handler)
# Log searches with context
def search_with_logging(client, query_vector, limit=10):
start = time.time()
try:
results = client.search(
collection_name="documents",
query_vector=query_vector,
limit=limit
)
elapsed = time.time() - start
logger.info(
"search_completed",
extra={
"duration_ms": elapsed * 1000,
"results_count": len(results),
"limit": limit,
"status": "success",
}
)
return results
except Exception as e:
elapsed = time.time() - start
logger.error(
"search_failed",
extra={
"duration_ms": elapsed * 1000,
"error_type": type(e).__name__,
"error_message": str(e),
"status": "failure",
}
)
raise
Log to centralized logging (ELK, Datadog, CloudWatch):
# Example: Query ElasticSearch for slow searches
GET logs/_search
{
"query": {
"bool": {
"must": [
{ "term": { "level": "info" } },
{ "term": { "message": "search_completed" } },
{ "range": { "extra.duration_ms": { "gte": 100 } } }
]
}
},
"size": 100
}
# Result: 10 slow searches in the last hour with latencies 120–250ms
Distributed Tracing: End-to-End Request Flow
For complex RAG systems, trace requests across components:
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure Jaeger tracing
jaeger_exporter = JaegerExporter(agent_host_name="jaeger", agent_port=6831)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(jaeger_exporter))
tracer = trace.get_tracer("vector-db-rag")
def rag_query(user_query):
"""RAG pipeline with tracing."""
with tracer.start_as_current_span("rag_query") as span:
span.set_attribute("user_query", user_query)
# 1. Embed query
with tracer.start_as_current_span("embed_query"):
query_embedding = embedding_model.encode(user_query)
# 2. Search vector DB
with tracer.start_as_current_span("vector_search") as span:
span.set_attribute("limit", 5)
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5
)
span.set_attribute("results_count", len(results))
# 3. Prompt LLM
with tracer.start_as_current_span("llm_prompt"):
context = "\n".join([r.payload["content"] for r in results])
prompt = f"Context:\n{context}\n\nQuestion: {user_query}"
response = llm.generate(prompt)
return response
# View trace in Jaeger UI
# http://localhost:16686 -> search for traces tagged "rag_query"
# Flamegraph shows time spent in each component
Key Takeaways
- Track p50/p95/p99 latencies, QPS, and recall. Alert when any exceeds SLA thresholds.
- Measure recall continuously on a test subset; exact ANN search is expensive.
- Log structured JSON with context fields (request ID, duration, error_type) for debugging.
- Implement distributed tracing (Jaeger, Datadog) to understand end-to-end latency across components.
- Build a monitoring dashboard in Grafana. Update it when SLAs change.
Frequently Asked Questions
How do I measure recall without a brute-force ground truth?
Maintain a small reference index (1K–10K vectors) on which you run exact search. Compare approximate results on this subset to the exact results to estimate recall. Refresh the reference index periodically.
What is an acceptable p99 latency for vector search?
Depends on your application. Real-time search (user-facing): < 100 ms. Background tasks: < 500 ms. If your p99 exceeds 500 ms, investigate index configuration, network latency, and hardware.
Should I alert on absolute latency or relative change?
Both. Alert on absolute latency (p99 > 100 ms) and relative increase (p99 increased by 50% from baseline). Relative alerts catch subtle degradation; absolute alerts catch catastrophic failures.
How often should I run recall tests?
Run hourly or at least daily. Recall can degrade over time as data skews (all new data in one shard, old data fragmented). Continuous monitoring catches degradation early.