Hybrid Caching: Exact Match plus Semantic
A hybrid cache combines exact-match (Redis, O(1) lookup) with semantic caching (ANN, O(log N)). Check exact first; if miss, check semantic. This two-tier strategy maximizes hit rate while minimizing latency: identical repeated queries hit Redis in microseconds, and paraphrases hit semantic cache in milliseconds. At 50M requests per month, this approach is standard in production systems from major cloud platforms to SaaS startups.
This article covers the architecture, implementation, and a case study from a real production pipeline. You will learn when each tier is profitable and how to tune both simultaneously.
Architecture: Two-Tier Lookup
Query arrives
↓
Tier 1: Check exact-match cache (Redis)
↓
Hit? → Return (latency: <1 ms)
Miss? ↓
↓
Tier 2: Compute embedding, check semantic cache (vector DB)
↓
Hit? → Return (latency: 5–20 ms)
Miss? ↓
↓
Tier 3: Call LLM (latency: 1–5 s)
↓
Store in both Redis (exact) and semantic cache
↓
Return response
The key insight: repeated queries (common in customer support, Q&A, documentation) are identical. Redis catches these 100% hits at near-zero cost. Paraphrases are rare enough that semantic-only caching on misses is acceptable.
Implementation: HybridCache Class
import redis
import hashlib
from datetime import datetime, timedelta
class HybridCache:
"""Two-tier cache: Redis exact-match + semantic fallback."""
def __init__(self, redis_url: str = "redis://localhost:6379",
semantic_cache=None, threshold: float = 0.95):
# Tier 1: Redis
self.redis_client = redis.from_url(redis_url)
self.redis_ttl_seconds = 86400 # 24 hours
# Tier 2: Semantic
self.semantic_cache = semantic_cache # SemanticCache instance from Article 3
self.threshold = threshold
# Metrics
self.exact_hits = 0
self.semantic_hits = 0
self.misses = 0
def _make_exact_key(self, query: str) -> str:
"""Generate a deterministic Redis key from query text."""
# Normalize whitespace; case-insensitive
normalized = " ".join(query.lower().split())
return f"query:{hashlib.md5(normalized.encode()).hexdigest()}"
def get_or_compute(self, query: str) -> tuple[str, str]:
"""
Hybrid lookup: exact first, then semantic, then LLM.
Returns: (response, cache_tier) where tier = "exact", "semantic", or "miss"
"""
# Tier 1: Exact-match cache (Redis)
exact_key = self._make_exact_key(query)
cached_response = self.redis_client.get(exact_key)
if cached_response:
self.exact_hits += 1
return cached_response.decode('utf-8'), "exact"
# Tier 2: Semantic cache (if available)
if self.semantic_cache:
query_embedding = embed_text(query)
match = self.semantic_cache.find_similar(query_embedding)
if match:
cached_response, similarity = match
self.semantic_hits += 1
return cached_response, "semantic"
# Tier 3: Cache miss — compute via LLM
self.misses += 1
response = call_llm(query)
# Store in both tiers for future hits
self._store_both_tiers(query, response)
return response, "miss"
def _store_both_tiers(self, query: str, response: str):
"""Store response in both Redis and semantic cache."""
# Tier 1: Redis exact-match
exact_key = self._make_exact_key(query)
self.redis_client.setex(
exact_key,
self.redis_ttl_seconds,
response
)
# Tier 2: Semantic cache
if self.semantic_cache:
embedding = embed_text(query)
self.semantic_cache.store(query, embedding, response)
def stats(self) -> dict:
"""Return cache performance statistics."""
total = self.exact_hits + self.semantic_hits + self.misses
return {
"exact_hits": self.exact_hits,
"semantic_hits": self.semantic_hits,
"misses": self.misses,
"total": total,
"exact_hit_rate": self.exact_hits / total if total > 0 else 0.0,
"semantic_hit_rate": self.semantic_hits / total if total > 0 else 0.0,
"miss_rate": self.misses / total if total > 0 else 0.0,
"combined_hit_rate": (self.exact_hits + self.semantic_hits) / total if total > 0 else 0.0
}
Case Study: 50M Request/Month Production Pipeline
A content generation platform (customer-facing AI writing assistant) serves 50M requests/month with 20K concurrent users.
Baseline (no caching):
- Latency: avg 2.1s, p99 5.2s.
- Cost: 50M * USD 0.003 (GPT-4o) = USD 150K/month.
After deploying hybrid cache (Redis + semantic, threshold 0.95):
| Metric | Value | Explanation |
|---|---|---|
| Tier 1 (exact) hit rate | 18% | Repeated queries (user re-runs same prompt) |
| Tier 2 (semantic) hit rate | 22% | Paraphrased queries (same intent, different words) |
| Combined hit rate | 40% | 20M requests served from cache |
| Cache misses | 30M | 60% require LLM inference |
| Exact-hit latency | 0.8 ms (p99: 2 ms) | Redis lookup time |
| Semantic-hit latency | 12 ms (p99: 35 ms) | Embedding + vector search |
| Miss latency | 2.1 s (unchanged) | LLM inference time |
| Average latency | 1.29 s | 0.18 * 0.0008 + 0.22 * 0.012 + 0.60 * 2.1 |
| Latency improvement | 39% | Baseline 2.1s → 1.29s |
Cost impact:
- Tier 1 (Redis): USD 200/month (self-hosted on AWS ElastiCache, 50 GB).
- Tier 2 (vector DB): USD 800/month (Pinecone, 1.5M vectors * 1536 dims).
- LLM costs:
30M misses * 0.003 = USD 90K. - Embedding API (tier 2 misses only):
30M * 0.00001 = USD 300. - Total with cache: USD 91.3K/month (vs. USD 150K baseline).
- Savings: USD 58.7K/month (39% reduction).
- ROI: Breakeven in 0.15 months (< 1 week).
Tuning Both Tiers
Redis tier tuning:
- TTL: Balance freshness vs. storage. 24 hours works for most content. For real-time data (stock prices, news), use shorter TTL (1–6 hours).
- Key design: Use normalized query text (case-insensitive, whitespace-collapsed) to maximize exact matches.
- Size management: Monitor Redis memory; set eviction policy (LRU, LFU) to auto-evict stale entries if near capacity.
Semantic tier tuning:
- Threshold: 0.95 is standard. A/B test (Article 6) to find your domain's sweet spot.
- Embedding batch size: Batch 50–100 queries/call to reduce embedding API cost. Use asynchronous workers to avoid blocking the request path.
- Vector DB scaling: Monitor query latency (p99 should be <50 ms). If it creeps above 100 ms, add indexing (IVF, HNSW) or shard vectors across multiple DB instances.
Metrics and Monitoring for Hybrid Caches
Track each tier separately to identify bottlenecks.
Example: Prometheus metrics for hybrid cache
from prometheus_client import Counter, Histogram
class HybridCacheMetrics:
def __init__(self):
self.exact_hits = Counter("cache_exact_hits_total", "Exact-match cache hits")
self.semantic_hits = Counter("cache_semantic_hits_total", "Semantic cache hits")
self.misses = Counter("cache_misses_total", "Cache misses (LLM inference required)")
self.exact_latency = Histogram(
"cache_exact_latency_ms",
"Exact-match lookup latency",
buckets=[0.1, 0.5, 1, 5, 10]
)
self.semantic_latency = Histogram(
"cache_semantic_latency_ms",
"Semantic lookup latency",
buckets=[1, 5, 10, 20, 50, 100]
)
def query_hybrid(query_text: str, cache: HybridCache, metrics: HybridCacheMetrics):
"""Execute query with hybrid cache and detailed metrics."""
start = time.time()
response, tier = cache.get_or_compute(query_text)
elapsed_ms = (time.time() - start) * 1000
if tier == "exact":
metrics.exact_hits.inc()
metrics.exact_latency.observe(elapsed_ms)
elif tier == "semantic":
metrics.semantic_hits.inc()
metrics.semantic_latency.observe(elapsed_ms)
else: # miss
metrics.misses.inc()
return response
Common Pitfalls and Fixes
Pitfall 1: Redis key collisions from poor normalization
- Issue: "What is async?" and "what is ASYNC?" treated as different keys.
- Fix: Normalize query before hashing: lowercase, strip punctuation, collapse whitespace.
Pitfall 2: Semantic cache fills with stale data
- Issue: After 3 months, 90% of cached responses are outdated (model changed, facts shifted).
- Fix: Implement TTL in semantic cache (Article 4); purge entries > 30 days old; periodically re-embed and recompute.
Pitfall 3: Exact tier dominates, semantic tier unused
- Issue: Hit rate is 35%, but exact tier accounts for 34%, semantic < 1%.
- Cause: Either threshold is too high or embeddings are low quality.
- Fix: Lower threshold by 0.02–0.05; benchmark embedding model on your domain.
Pitfall 4: Combined hit rate does not improve with semantic cache
- Issue: Exact + semantic = 35%, but exact alone = 32%. Semantic adds only 3%.
- Cause: Queries are too diverse (little paraphrasing); users ask truly different questions.
- Action: Semantic caching may not be cost-effective for your workload. Focus on exact matches or increase TTL in semantic tier.
Key Takeaways
- Hybrid cache architecture (exact + semantic) is the production standard, combining O(1) Redis lookups with O(log N) ANN search.
- Typical performance: 15–30% exact hits, 15–40% semantic hits, 30–70% misses. Combined hit rates of 40–70% are common.
- Cost savings scale linearly: 40% hit rate saves 40% of LLM costs minus cache infrastructure (typically 20–30% net savings after Redis and vector DB).
- Monitor both tiers independently; one tier may saturate before the other.
Frequently Asked Questions
Should I use Redis or a hosted cache service (Memcached, DynamoDB)?
Redis is fastest (in-process or low-latency network, <1ms). DynamoDB adds 5–10 ms per lookup. For latency-sensitive applications (user-facing), self-hosted Redis or ElastiCache is worth it. For batch/background jobs, DynamoDB is acceptable.
Can I use semantic caching without exact-match?
Yes, if you have limited storage or want to simplify. You lose 15–30% of hit rate and 1–2 orders of magnitude in latency (10s of ms instead of microseconds), but one cache system is easier to operate. Cost savings are roughly proportional to hit rate; 40% combined (exact + semantic) vs. 22% (semantic-only) = 82% more cost savings.
How do I migrate from exact-only to hybrid?
- Deploy semantic cache alongside Redis. 2. Route 10% of traffic to hybrid, 90% to exact-only for 1 week. 3. Monitor hit rates and latencies. 4. If metrics improve, roll out to 100%. 5. Retire pure exact-match tier after 30 days.
What if Redis or vector DB fails?
Graceful degradation: if exact cache is down, skip to semantic (add 10 ms). If semantic is down, skip to LLM (add 2s). Implement circuit breakers to detect failures and fall through automatically.
Can I combine hybrid caching with prompt caching (Anthropic)?
Yes. Hybrid caching (this article) is full-response caching. Prompt caching (Anthropic's feature) caches the embedding of a prompt prefix to avoid recomputing it on each request. Use prompt caching within semantic cache misses to further reduce LLM inference time.
Further Reading
- Redis Persistence and Eviction Policies — Tuning Redis for production caching.
- Two-Level Cache Coherence in Distributed Systems — Academic framework for multi-tier caching.
- ElastiCache Best Practices (AWS) — Operational patterns for managed Redis.
- Vector Database Performance Comparison 2026 — Benchmarks on latency and cost across Pinecone, Weaviate, Milvus.