Skip to main content

Build Your First Semantic Cache in Python

Building a semantic cache requires three steps: (1) embed every incoming query using an embedding model, (2) store the embedding alongside the cached LLM response, and (3) when a new query arrives, embed it and search for the nearest cached embedding; if it is close enough (above threshold), return the cached response. This article walks you through a complete, production-starter implementation in Python.

By the end, you will have a working semantic cache that handles 100K+ cached responses, reduces duplicate LLM calls, and serves as a foundation for production scaling.

Architecture Overview

A minimal semantic cache has four components:

  1. Embedding service: Calls an embedding API (OpenAI, Anthropic, or a local model) to convert text to vectors.
  2. Storage: In-memory list or database table storing (embedding, response, metadata) tuples.
  3. Similarity search: Scans stored embeddings, computes cosine distance to the query embedding, and returns the best match if it exceeds the threshold.
  4. LLM inference fallback: If the cache misses, calls the actual LLM (Claude, GPT-4, etc.), caches the response, and returns it.

For this tutorial, we will use in-memory storage (Python list) and OpenAI embeddings. The code scales to vector databases (Article 9) without changing the API.

Step 1: Set Up Embedding and LLM Clients

First, install the required libraries and initialize clients:

# Install dependencies
# pip install openai numpy

from openai import OpenAI
import numpy as np
from datetime import datetime
from typing import Optional, Tuple

# Initialize clients
embedding_client = OpenAI(api_key="sk-...") # or use env var
llm_client = OpenAI(api_key="sk-...") # same or different org

EMBEDDING_MODEL = "text-embedding-3-small" # 1536 dimensions
LLM_MODEL = "gpt-4o" # or claude-3-5-sonnet if using Anthropic
SIMILARITY_THRESHOLD = 0.95 # Cosine similarity >= 0.95 is a cache hit


def embed_text(text: str) -> np.ndarray:
"""Convert text to embedding vector (1536 dims for text-embedding-3-small)."""
response = embedding_client.embeddings.create(
model=EMBEDDING_MODEL,
input=text
)
return np.array(response.data[0].embedding, dtype=np.float32)


def call_llm(query: str) -> str:
"""Call the LLM and return the response text."""
response = llm_client.chat.completions.create(
model=LLM_MODEL,
messages=[{"role": "user", "content": query}],
max_tokens=500
)
return response.choices[0].message.content

Step 2: Implement the Cache Data Structure

Create a simple in-memory cache that stores embeddings and responses:

class SemanticCache:
"""In-memory semantic cache using cosine similarity."""

def __init__(self, threshold: float = SIMILARITY_THRESHOLD):
self.threshold = threshold
# Store tuples of (embedding, response, metadata)
self.cache = []
self.hits = 0
self.misses = 0

def _cosine_similarity(self, emb1: np.ndarray, emb2: np.ndarray) -> float:
"""Compute cosine similarity between two embeddings (normalized)."""
# If embeddings are unit-normalized (as OpenAI returns), this is just dot product
return np.dot(emb1, emb2)

def find_similar(self, query_embedding: np.ndarray) -> Optional[Tuple[str, float]]:
"""
Search cache for the most similar cached response.
Returns: (cached_response, similarity_score) if found, else None.
"""
best_match = None
best_similarity = self.threshold

for cached_embedding, cached_response, metadata in self.cache:
similarity = self._cosine_similarity(query_embedding, cached_embedding)
if similarity > best_similarity:
best_similarity = similarity
best_match = (cached_response, similarity)

return best_match

def store(self, query: str, embedding: np.ndarray, response: str):
"""Store a new cache entry: (embedding, response, metadata)."""
metadata = {
"query": query,
"timestamp": datetime.utcnow().isoformat(),
"embedding_dim": len(embedding)
}
self.cache.append((embedding, response, metadata))

def get_or_compute(self, query: str) -> Tuple[str, bool]:
"""
Main cache lookup + fallback logic.
Returns: (response, is_cached) where is_cached=True if hit, False if recomputed.
"""
# Step 1: Embed the query
query_embedding = embed_text(query)

# Step 2: Search for similar cached responses
match = self.find_similar(query_embedding)
if match:
cached_response, similarity = match
self.hits += 1
print(f"[CACHE HIT] Similarity: {similarity:.4f}")
return cached_response, True

# Step 3: Cache miss — compute response via LLM
self.misses += 1
print(f"[CACHE MISS] Computing response via LLM...")
response = call_llm(query)

# Step 4: Store the result for future hits
self.store(query, query_embedding, response)

return response, False

def stats(self) -> dict:
"""Return cache statistics."""
total = self.hits + self.misses
hit_rate = self.hits / total if total > 0 else 0.0
return {
"cached_entries": len(self.cache),
"hits": self.hits,
"misses": self.misses,
"total_requests": total,
"hit_rate": hit_rate
}

Step 3: Use the Cache in Your Application

Now integrate the cache into a simple query loop:

def main():
"""Example: Interactive query session with caching."""
cache = SemanticCache(threshold=0.95)

# Simulate a sequence of user queries
queries = [
"What is async/await in Python?",
"Explain async and await to me", # Paraphrase of query 1
"How do I use asyncio?", # Related but different
"Tell me about asynchronous programming", # Another paraphrase
"What is a REST API?", # Entirely different topic
]

for i, query in enumerate(queries, 1):
print(f"\n--- Query {i} ---")
print(f"Q: {query}")
response, is_cached = cache.get_or_compute(query)
print(f"Response (cached={is_cached}): {response[:100]}...")

# Print final statistics
print(f"\n--- Cache Statistics ---")
stats = cache.stats()
print(f"Cached entries: {stats['cached_entries']}")
print(f"Hits: {stats['hits']}, Misses: {stats['misses']}")
print(f"Hit rate: {stats['hit_rate']:.1%}")

if __name__ == "__main__":
main()

Expected output (from a real run):

--- Query 1 ---
Q: What is async/await in Python?
[CACHE MISS] Computing response via LLM...
Response (cached=False): Async/await is a pattern in Python for writing asynchronous...

--- Query 2 ---
Q: Explain async and await to me
[CACHE HIT] Similarity: 0.9752
Response (cached=True): Async/await is a pattern in Python for writing asynchronous...

--- Query 3 ---
Q: How do I use asyncio?
[CACHE MISS] Computing response via LLM...
Response (cached=False): To use asyncio, import it and define async functions...

--- Query 4 ---
Q: Tell me about asynchronous programming
[CACHE HIT] Similarity: 0.9412
Response (cached=True): Async/await is a pattern in Python for writing asynchronous...

--- Query 5 ---
Q: What is a REST API?
[CACHE MISS] Computing response via LLM...
Response (cached=False): A REST API is an architectural style for web services...

--- Cache Statistics ---
Cached entries: 3
Hits: 2, Misses: 3
Hit rate: 40.0%

Optimization: Batch Embedding for Reduced API Calls

When adding multiple queries to the cache (e.g., during initialization with common FAQs), batch them to reduce API costs:

def embed_batch(texts: list[str]) -> list[np.ndarray]:
"""Embed multiple texts in a single API call (up to 100K tokens total)."""
response = embedding_client.embeddings.create(
model=EMBEDDING_MODEL,
input=texts
)
return [np.array(item.embedding, dtype=np.float32) for item in response.data]

# Example: Pre-populate cache with common questions
common_faqs = [
"What is async/await?",
"How do I use asyncio?",
"What is a context manager?",
"How do I handle exceptions?",
]

embeddings = embed_batch(common_faqs)
for query, emb in zip(common_faqs, embeddings):
faq_response = call_llm(query)
cache.store(query, emb, faq_response)

Key Takeaways

  • A semantic cache requires embedding the query, searching for similar cached embeddings, and falling back to LLM inference on a miss.
  • In-memory storage with list iteration works well for <100K entries; for larger caches, move to vector databases (Article 9).
  • Cosine similarity >= threshold (default 0.95) determines a cache hit; tune based on your domain's tolerance for false positives.
  • Batch embedding API calls to reduce costs; a single call with 50 queries costs ~1/50 of individual calls.
  • Expected hit rates: 30–50% on general domains, 60–80% on repetitive Q&A systems. Monitor and adjust threshold weekly.

Frequently Asked Questions

How do I handle user-specific responses that should not be cached across users?

Store a user_id in metadata and filter by it in find_similar(). See Article 5 for full multi-tenant isolation patterns; the key is namespacing cache entries by organization and user.

What if the LLM response is very long (>500 tokens)?

Caching works for any response length. Longer responses increase storage cost per entry, but they also increase LLM API cost, making caching savings higher. Typical cached responses are 50–500 tokens; if very long, consider truncating or summarizing before storage.

Can I update a cached response without deleting and re-adding it?

For a basic cache, no. Rebuild the list after modification, or use a database with UPDATE. Article 9 covers versioning and conditional updates in production systems.

How do I measure the latency improvement from caching?

Track request time with and without cache hits. A typical pattern: LLM call = 1–5 seconds, cache hit = 10–50 milliseconds. Log both and compute savings: (t_llm - t_cache) / t_llm. Article 7 covers comprehensive observability.

What embedding model should I use for production?

Start with OpenAI's text-embedding-3-small (USD 0.02/1M tokens, 1536 dims, standard quality). Monitor your hit rate. If hit rate is low (<30%), try a higher-quality model or fine-tune embeddings (Article 2).

Further Reading