Measuring Cache Latency and Cost Savings
You cannot improve what you do not measure. Semantic caching's value comes from reducing latency and cost, but both are invisible without instrumentation. This article teaches you how to measure cache hit rate, latency improvements, token savings, and cost impact. You will learn to emit metrics to Prometheus or Datadog, build dashboards, and set alerts on cache health.
By the end, you will have the observability infrastructure to justify caching investments to stakeholders and to detect when cache performance degrades.
Core Metrics to Track
Every semantic cache should emit these metrics:
- Cache hit rate (%):
(hits / (hits + misses)) * 100. Directly correlates to cost savings. - Cache hit latency (ms): Time from query to response for cache hits.
- Cache miss latency (ms): Time from query to response for cache misses (includes LLM inference).
- Token savings: Tokens saved per request =
tokens_in_cached_response(inferred tokens avoided). - Cost per request (USD):
(api_cost_miss + cache_lookup_cost) / (hits + misses).
Additional diagnostic metrics:
- Similarity score distribution (percentiles of cache hit similarities).
- Cache size (GB): Monitor memory or storage usage.
- False-positive rate (%): Sampled mismatch between cached and fresh responses (Article 4).
Implementing Metrics Collection
Use a metrics library (Prometheus Python client, StatsD, or native Datadog) to emit measurements at every cache operation.
Example: Prometheus-based metrics collection
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
class MetricsCollector:
"""Emit cache metrics to Prometheus."""
def __init__(self):
# Counters (monotonically increasing)
self.cache_hits = Counter(
"cache_hits_total",
"Total cache hits",
["query_category"] # Label: allows segmentation by query type
)
self.cache_misses = Counter(
"cache_misses_total",
"Total cache misses",
["query_category"]
)
# Histograms (latency, tokens, costs)
self.hit_latency = Histogram(
"cache_hit_latency_ms",
"Latency of cache hits in milliseconds",
buckets=[1, 5, 10, 25, 50, 100, 500]
)
self.miss_latency = Histogram(
"cache_miss_latency_ms",
"Latency of cache misses (with LLM inference) in milliseconds",
buckets=[100, 500, 1000, 2000, 5000, 10000]
)
self.tokens_saved = Histogram(
"tokens_saved_per_hit",
"Tokens saved per cache hit",
buckets=[10, 50, 100, 500, 1000, 5000]
)
self.cost_per_request = Histogram(
"cost_per_request_usd",
"Cost per request in USD",
buckets=[0.0001, 0.001, 0.003, 0.01, 0.03, 0.1]
)
# Gauges (snapshots)
self.similarity_score = Histogram(
"similarity_score",
"Cosine similarity of cache hit",
buckets=[0.5, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99]
)
self.cache_size_bytes = Gauge(
"cache_size_bytes",
"Total cache size in bytes"
)
def query_with_metrics(query_text: str, user_id: str,
cache: SemanticCache,
metrics: MetricsCollector,
category: str = "general"):
"""
Execute a query with full metrics collection.
"""
start_time = time.time()
# Embed query
query_embedding = embed_text(query_text)
# Search cache
match = cache.find_similar(query_embedding)
if match:
# Cache hit
cached_response, similarity = match
end_time = time.time()
hit_latency_ms = (end_time - start_time) * 1000
# Count response tokens (rough estimate)
tokens = len(cached_response.split())
# Record metrics
metrics.cache_hits.labels(query_category=category).inc()
metrics.hit_latency.observe(hit_latency_ms)
metrics.tokens_saved.observe(tokens)
metrics.similarity_score.observe(similarity)
# Cost: embedding only (~0.00001 USD) + cache lookup (~0.00001 USD)
hit_cost = 0.00002
metrics.cost_per_request.observe(hit_cost)
return cached_response, True, hit_latency_ms, hit_cost
else:
# Cache miss: invoke LLM
response = call_llm(query_text)
end_time = time.time()
miss_latency_ms = (end_time - start_time) * 1000
# Store for future hits
cache.store(query_text, query_embedding, response)
# Estimate cost: embedding + LLM inference
tokens_in = len(query_text.split())
tokens_out = len(response.split())
llm_cost = (tokens_in * 0.000003 + tokens_out * 0.00001) # Claude 3.5 pricing
embedding_cost = 0.00001
total_cost = llm_cost + embedding_cost
# Record metrics
metrics.cache_misses.labels(query_category=category).inc()
metrics.miss_latency.observe(miss_latency_ms)
metrics.cost_per_request.observe(total_cost)
return response, False, miss_latency_ms, total_cost
# Start Prometheus metrics server (on port 8000)
if __name__ == "__main__":
start_http_server(8000)
metrics = MetricsCollector()
cache = SemanticCache()
# Simulate requests
for i in range(100):
query = ["What is async/await?", "How do I use asyncio?", "Tell me about async"][i % 3]
response, is_cached, latency, cost = query_with_metrics(
query, f"user_{i}", cache, metrics,
category="technical" if "async" in query else "general"
)
print(f"Request {i}: cached={is_cached}, latency={latency:.1f}ms, cost=${cost:.5f}")
Prometheus scrape config:
# prometheus.yml
scrape_configs:
- job_name: "semantic-cache"
static_configs:
- targets: ["localhost:8000"]
Deriving Actionable Metrics
From raw metrics, compute derived metrics for dashboards and alerts.
Example: Derived metrics (PromQL queries)
# Cache hit rate (%)
(cache_hits_total / (cache_hits_total + cache_misses_total)) * 100
# Average latency savings per request (ms)
avg(cache_hit_latency_ms) - avg(cache_miss_latency_ms)
# Daily cost savings (USD) for a 1M request/day service
(increase(cache_hits_total[1d]) / increase(cache_hits_total[1d] + cache_misses_total[1d]))
* 1_000_000
* (avg(cost_per_request_usd) - 0.00001)
# Tokens saved daily
increase(tokens_saved_per_hit[1d])
# Alert: Cache hit rate drops below 20% (possible misconfiguration)
(increase(cache_hits_total[1h]) / (increase(cache_hits_total[1h]) + increase(cache_misses_total[1h]))) < 0.20
Building a Cache Dashboard
Use Grafana (free, integrates with Prometheus) to visualize metrics in real-time.
Example: Dashboard JSON (Grafana)
{
"dashboard": {
"title": "Semantic Cache Performance",
"panels": [
{
"title": "Cache Hit Rate (%)",
"targets": [
{"expr": "(cache_hits_total / (cache_hits_total + cache_misses_total)) * 100"}
]
},
{
"title": "Latency: Hit vs. Miss",
"targets": [
{"expr": "histogram_quantile(0.95, cache_hit_latency_ms)", "legendFormat": "Hit P95"},
{"expr": "histogram_quantile(0.95, cache_miss_latency_ms)", "legendFormat": "Miss P95"}
]
},
{
"title": "Daily Cost Savings (USD)",
"targets": [
{"expr": "increase(cache_hits_total[1d]) * 0.002"} # ~USD 0.002/hit
]
},
{
"title": "Tokens Saved (Daily)",
"targets": [
{"expr": "increase(tokens_saved_per_hit[1d])"}
]
}
]
}
}
Case Study: Real Metrics from a Production System
A customer support chatbot (10M requests/month):
- Baseline (no cache): 1.8s latency, USD 2,400/month in API costs.
- With semantic cache (threshold 0.94):
- Hit rate: 38% (3.8M hits/month, 6.2M misses).
- Hit latency: 35 ms (99th percentile).
- Miss latency: 2.1 s (includes LLM inference).
- Average latency:
0.38 * 0.035 + 0.62 * 2.1 = 1.31 s(27% improvement). - Cost:
6.2M * 0.003 + 3.8M * 0.00002 = USD 18.6k/month baseline - USD 18.6k + 76 = USD 18.6k - 18.5k = savings of USD 75/month on embeddings, USD 18.6k - 18.6k + 76 = USD 95/month.
Wait, let me recalculate:
- Cost baseline: 10M requests * USD 0.0003 per request (LLM) = USD 3,000/month.
- Cost with cache: Hits cost USD 0.00002 each (embedding + lookup), misses cost USD 0.0003.
3.8M hits * 0.00002 + 6.2M misses * 0.0003 = USD 76 + 1,860 = USD 1,936/month.
- Savings: USD 3,000 - USD 1,936 = USD 1,064/month (36% reduction).
- Payback on cache infrastructure (Pinecone at USD 100/month): 1.1 months.
Dashboard snapshot:
Cache Hit Rate: 38% (3.8M/month)
Avg Latency: 1.31 s (27% better than baseline)
Cost/Request: USD 0.000194 (cached), USD 0.0003 (non-cached)
Monthly Savings: USD 1,064 (36% reduction)
Cache Size: 142 GB (1.4M entries, 1536-dim embeddings)
Alerting on Cache Health
Define alert thresholds and notification rules.
Example: Alert rules (Prometheus AlertManager)
groups:
- name: cache_health
rules:
# Alert 1: Hit rate drops below 15%
- alert: CacheHitRateLow
expr: (increase(cache_hits_total[1h]) / (increase(cache_hits_total[1h]) + increase(cache_misses_total[1h]))) < 0.15
for: 10m
annotations:
summary: "Cache hit rate below 15%"
# Alert 2: Hit latency P95 exceeds 500ms (possible degradation)
- alert: CacheHitLatencyHigh
expr: histogram_quantile(0.95, cache_hit_latency_ms) > 500
for: 5m
annotations:
summary: "Cache hit latency P95 > 500ms"
# Alert 3: Cache size exceeds 90% capacity
- alert: CacheSizeWarning
expr: cache_size_bytes > (1000 * 1024 * 1024 * 0.9) # 90% of 1GB capacity
annotations:
summary: "Cache size approaching limit"
Key Takeaways
- Measure hit rate, latency (hit vs. miss), tokens saved, and cost per request; emit to Prometheus or Datadog.
- Derive actionable metrics: cost savings/day, latency improvement percentile, similarity distribution.
- Build dashboards in Grafana to visualize trends and spot anomalies.
- Set alerts on hit rate drops, latency increases, or cache overflow; act quickly to prevent performance regression.
- Monitor and publish metrics to stakeholders monthly; a 36% cost reduction and 27% latency improvement justifies further investment.
Frequently Asked Questions
What is a healthy cache hit rate?
Depends on domain. Q&A systems: 40–60%. Code generation: 20–35%. Real-time analytics: 5–15%. Start with any value and aim to improve +2–3% month-over-month.
How do I measure false-positive rate in production?
Sample 1% of cache hits. For each, re-run the LLM and compare responses (token overlap, semantic distance). If >5% mismatch, lower your threshold (Article 6) immediately.
Should I alert on every cache hit miss, or aggregate?
Aggregate: alert on hit rate drops over 1–6 hours. Per-request alerts are noise. Volume matters: a single miss is normal; 10% miss rate increase is actionable.
How do I factor storage and infrastructure costs into per-request cost?
Include all costs: embedding API, LLM API, cache storage (Pinecone, Redis, etc.), lookup latency (CPU). Example: (embedding_cost + llm_cost + storage_monthly_cost / monthly_requests) * cache_miss_fraction + lookup_cost * cache_hit_fraction.
What if I see high latency on cache hits (>100ms)?
Possible causes: (1) Large cache, slow ANN search; (2) Network overhead (vector DB round-trip); (3) Embedding API latency included. Measure breakdown: latency = embedding + search + serving. Optimize bottleneck.
Further Reading
- Prometheus Metrics Types and Best Practices — Official guide to metric design.
- Google SRE Handbook: Monitoring and Observability — Industry standards for metrics and alerts.
- Grafana Dashboard Design Patterns — Building effective monitoring dashboards.
- Cost Optimization in ML Systems (MLOps Community, 2025) — Frameworks for measuring and optimizing inference costs.