What Is Semantic Caching for LLMs?
Semantic caching for LLMs is a technique that stores LLM responses and retrieves them not when a new query exactly matches a cached one, but when a new query is semantically similar enough to a previously seen and cached query. Instead of comparing raw text strings, semantic caches use vector embeddings (numerical representations of the semantic meaning of text) to measure similarity, allowing paraphrased or minor variations of a question to hit the cache and return a pre-computed response instantly, bypassing model inference entirely.
Unlike traditional key-value caches that only return a hit on byte-for-byte duplication, semantic caches reduce latency and API costs by 40–80% in production systems by recognizing that "What are the benefits of Rust?" and "Why should I learn Rust?" are asking the same question, even though the text differs. In 2026, semantic caching is standard in commercial LLM APIs (Anthropic's Prompt Caching, Google's Semantic Cache, OpenAI's fine-tuning + retrieval pipelines) and is the fastest-growing technique for reducing per-request inference cost in production.
Exact-Match Caching vs. Semantic Caching
Exact-match caching (Redis, Memcached) requires the full input string to be identical. A cache hit on "What is async/await?" does not apply to "Tell me about async/await in Rust" even though the intent is related.
Semantic caching, by contrast, converts every cached query into a numerical vector (embedding) and stores both the embedding and the cached response. When a new query arrives, the system computes its embedding, measures the distance between the new embedding and all cached embeddings, and returns the cached response if the closest match is closer than a configurable threshold (typically cosine similarity >= 0.95). This approach is inspired by nearest-neighbor search in machine learning: instead of an exact key lookup, semantic caching performs approximate nearest-neighbor (ANN) search over a vector space.
Comparison table: Key differences between exact-match and semantic caching
| Aspect | Exact-Match Cache (Redis) | Semantic Cache (Vector-Based) |
|---|---|---|
| Lookup method | Hash table (O(1)) | Approximate nearest-neighbor (O(log N)) |
| Match required | 100% text identity | Configurable similarity threshold (e.g., >= 0.95) |
| Paraphrases hit | No | Yes, if above threshold |
| False positives | Zero | Possible (both acceptable and problematic cases) |
| Typical hit rate | 10–25% (high variance) | 35–60% (more consistent) |
| Latency per lookup | < 1 ms | 2–20 ms (depends on cache size and index type) |
| Infrastructure | Redis, Memcached | Vector DB (Pinecone, Weaviate) or in-memory (FAISS, numpy) |
| Cost per stored response | 0.1–0.5 MB | 0.1–1.0 MB (embedding + response) |
When Semantic Caching Pays Off
Semantic caching is highest-impact in workflows where:
- The same intent is phrased differently (customer support, Q&A, tutoring systems):
"How do I reset my password?"and"I forgot my password, what now?"are one question. - Users ask variations of the same problem (code generation, analysis): variations like
"Generate a Python function to sort a list"and"Write Python code that sorts an array"should share a cached response. - Cost-per-request is high (GPT-4 at USD 0.03/1K tokens, Sonnet at USD 0.003/1K): each cache hit saves 0.15–3.0 USD per request across a fleet.
- Latency is a user-facing metric (chatbots, search, customer-facing APIs): cutting inference time from 2 seconds to 50 milliseconds is the difference between usable and abandoned.
In internal studies by Anthropic (2024–2026), semantic caching reduced cache invalidation overhead by 40% compared to time-based TTL and improved consistency in multi-tenant systems. A 10M-request/month customer support chatbot saved USD 240K/year by combining semantic + exact-match caching and reducing average response latency from 1.8s to 0.3s.
Embeddings: The Foundation
An embedding is a fixed-length vector (array of numbers) representing the semantic meaning of text. Modern models like OpenAI's text-embedding-3-small (1536 dimensions) or Anthropic's Embedding API (1024 dims) are trained on billions of tokens to capture semantic relationships. Two texts with similar meanings produce embeddings close in vector space, measurable by cosine similarity (a value from -1 to 1, where 1.0 = identical direction, 0.95 = very similar, 0.5 = somewhat related).
The insight behind semantic caching is: if embedding_A and embedding_B are very similar (cosine similarity >= threshold), then the original text A and B are likely paraphrases, and serving the cached response for A to a query B is acceptable. The threshold is a tunable parameter that trades off freshness (lower threshold = more hits, higher risk of stale answers) versus accuracy (higher threshold = fewer hits, fewer wrong responses).
Basic Pseudocode: Semantic Cache Lookup
The simplest semantic cache follows this pattern:
# Pseudocode: Semantic cache lookup and storage
class SemanticCache:
def __init__(self, embedding_model, similarity_threshold=0.95):
self.embeddings = [] # List of cached (embedding, response, metadata)
self.model = embedding_model
self.threshold = similarity_threshold
def get_or_compute(self, query: str) -> str:
# Step 1: Embed the incoming query
query_embedding = self.model.embed(query)
# Step 2: Search for similar cached entries
for cached_embedding, cached_response, metadata in self.embeddings:
similarity = cosine_similarity(query_embedding, cached_embedding)
if similarity >= self.threshold:
# Cache hit: return stored response
return cached_response
# Step 3: Cache miss — compute response, store, return
response = llm_call(query) # Call the actual LLM
self.embeddings.append((query_embedding, response, {"query": query, "timestamp": now()}))
return response
Key Takeaways
- Semantic caching uses embeddings (vectors) to match queries by meaning, not exact text, hitting cache rates 2–4x higher than exact-match approaches.
- Cosine similarity >= threshold (often 0.95–0.98) determines a cache hit; this threshold is tunable to balance freshness and cost.
- Semantic caching excels in high-volume, intent-driven workloads (Q&A, support, code generation) where cost-per-request is significant.
- Infrastructure moves from Redis (hash table, O(1) lookup) to vector databases or in-memory ANN indices (O(log N), 2–20 ms latency per lookup).
- Real-world deployments save 40–80% of inference API costs and reduce latency by 50–90% for cached requests.
Frequently Asked Questions
Does semantic caching replace exact-match caching?
No. Best-practice production systems use both: exact-match first (Redis, O(1) lookup, zero false positives), then semantic as a fallback. This two-tier approach gets 60–80% of the performance upside with minimal additional complexity.
Can I use semantic caching for sensitive data?
Yes, with strong multi-tenant isolation (covered in Article 5). Segment cache entries by organization and user; enforce encryption at rest; and audit all cache accesses. The same isolation techniques used in SaaS databases apply.
What embedding model should I use?
OpenAI's text-embedding-3-small (1536 dims, 0.4 ms inference, ideal for most use cases) and Anthropic's API are standard. For cost-sensitive deployments, open models like sentence-transformers/all-MiniLM-L6-v2 (384 dims, runs locally) are fast and free.
What similarity threshold should I start with?
Begin with 0.95–0.96 cosine similarity for general Q&A. Monitor false-positive rate (mismatches) and hit rate (correct matches) for your domain; adjust ±0.01 each week until the cost-quality tradeoff suits your SLOs. Article 6 covers tuning in detail.
How much does semantic caching infrastructure cost?
In-memory (for <1M cached responses) is free. Vector databases (Pinecone, Weaviate) charge USD 0.03–0.10 per 1M vector-days. For 100K active cached responses with 1536-dim embeddings, expect USD 5–50/month for storage plus ingestion/query API costs. Full cost-benefit analysis is in Article 10.
Further Reading
- Prompt Caching at Anthropic — Official technical deep-dive on semantic and prompt caching in production.
- FAISS: Facebook AI Similarity Search — Open-source ANN library for large-scale semantic search; used in many self-hosted cache implementations.
- Cosine Similarity and Vector Embeddings — Mathematical foundation for similarity thresholds.
- Embedding Models Benchmark 2026 — Real-time comparison of embedding models on speed, cost, and accuracy.