Skip to main content

How Embeddings Enable Semantic Cache Keys

An embedding is a fixed-size vector (list of numbers) that captures the semantic meaning of a piece of text. Rather than storing prompts as strings in your cache, semantic caching stores embeddings (vectors) as keys, and compares new queries to cached embeddings using similarity metrics. The most common metric is cosine similarity, which measures the angle between two vectors: a value near 1.0 means they point in nearly the same direction (semantically similar), while a value near 0.0 or negative means they are unrelated or opposite.

This shift from string-based keys to vector-based keys is the core innovation enabling semantic caching. Where traditional caches compute a hash of the string and look for an exact match, semantic caches compute distances between high-dimensional vectors and find near-neighbors, tolerating small variations in phrasing while still identifying the same intent.

What Are Embeddings and How Are They Created?

An embedding model is a neural network trained on vast amounts of text to learn a mapping from strings to fixed-length vectors. For example, OpenAI's text-embedding-3-small encodes any input string (up to 8,000 tokens) into a 1,536-dimensional vector. Internally, the model processes the text through transformer layers, extracting semantic relationships, and the final layer outputs a single vector that summarizes the meaning.

Example: Embedding generation with OpenAI's API

from openai import OpenAI

client = OpenAI(api_key="sk-...")

# Two different phrasings of the same question
query_1 = "What is async/await in Rust?"
query_2 = "Tell me about async and await in Rust"

# Generate embeddings
emb_1 = client.embeddings.create(
model="text-embedding-3-small",
input=query_1
).data[0].embedding

emb_2 = client.embeddings.create(
model="text-embedding-3-small",
input=query_2
).data[0].embedding

print(f"Embedding 1 length: {len(emb_1)}") # Output: 1536
print(f"Embedding 2 length: {len(emb_2)}") # Output: 1536

# Both are vectors of 1536 floats; content is different but semantically related

Modern embedding models (2024–2026) have several properties crucial for caching:

  • Dimensionality: OpenAI (1536), Anthropic (1024), Cohere (4096). Higher dims often improve quality but increase storage and lookup cost.
  • Training data: Trained on billions of diverse tokens, so embeddings capture both common knowledge and domain-specific relationships.
  • Stability: Same input always produces identical embedding; repeatable and deterministic.
  • Generalization: Embeddings from unseen text still cluster meaningfully near semantically related cached entries, even if the exact phrasing was never seen during training.

Cosine Similarity: Measuring Cache Key Distance

Cosine similarity is the standard metric for comparing embeddings in cache lookups. It measures the cosine of the angle between two vectors, returning a value between -1.0 (opposite direction) and 1.0 (same direction). The formula is:

cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)

Where A · B is the dot product and ||A|| is the magnitude. In practice, for normalized embeddings (which most models produce), this simplifies to a direct dot product.

Practical example: Similarity calculation

import numpy as np

def cosine_similarity(emb_1, emb_2):
"""Compute cosine similarity between two embedding vectors."""
norm_1 = np.linalg.norm(emb_1)
norm_2 = np.linalg.norm(emb_2)
dot_product = np.dot(emb_1, emb_2)
return dot_product / (norm_1 * norm_2)

# Two cached embeddings
query_A = "What is async/await?"
query_B = "Tell me about async/await"
query_C = "What is a REST API?"

# Hypothetical embeddings (in reality these are 1536-dim, but shown truncated)
emb_A = np.array([0.1, 0.8, 0.5, ...]) # Related to async/await
emb_B = np.array([0.12, 0.79, 0.51, ...]) # Very similar to A
emb_C = np.array([0.6, 0.2, 0.3, ...]) # About APIs, different topic

sim_A_B = cosine_similarity(emb_A, emb_B) # Likely ~0.98 (high)
sim_A_C = cosine_similarity(emb_A, emb_C) # Likely ~0.35 (low)

print(f"Similarity A–B: {sim_A_B:.3f}") # e.g., 0.981
print(f"Similarity A–C: {sim_A_C:.3f}") # e.g., 0.342

In a semantic cache with threshold >= 0.95, the A–B pair would be a cache hit (serve cached response for A when B arrives), but A–C would miss (recompute for C because they are about different topics).

Choosing an Embedding Model for Your Cache

OpenAI text-embedding-3-small (recommended for most use cases):

  • Cost: USD 0.02 per 1M tokens (as of 2026).
  • Speed: ~0.4 ms per request (can batch 10–100 queries for efficiency).
  • Quality: 1536 dimensions, trained on diverse public and proprietary data; strong on general knowledge, code, and technical writing.
  • Stability: API-backed; requires authentication and internet connectivity but guarantees consistency.

Open-source alternatives (sentiment-transformers):

  • Models like all-MiniLM-L6-v2 (384 dims, free, runs locally) or all-mpnet-base-v2 (768 dims).
  • Speed: 1–3 ms on CPU, better on GPU.
  • Quality: Good for general similarity; may underperform on highly specialized domains without fine-tuning.
  • Use case: Cost-sensitive internal systems, no API dependency, low latency requirements.

Anthropic Embedding API (emerging 2026):

  • Cost: Competitive with OpenAI.
  • Dimensions: 1024, tuned for long-context reasoning.
  • Adoption: Growing in production systems using Anthropic Claude for LLM inference.

For semantic caching, the choice of embedding model affects cache hit rate and responsiveness. Empirically, higher-quality models (OpenAI, Anthropic) improve hit rates by 5–15% because they better capture semantic relationships; the cost is slight latency per embedding and API charges. For internal caching, deploying an open model locally is often cost-optimal.

Similarity Thresholds and False Positives

A semantic cache lookup returns a cached response when the similarity exceeds a threshold (commonly 0.95–0.98). Lower thresholds increase cache hit rate but risk false positives (serving a wrong cached response). Higher thresholds reduce false positives but miss true semantically equivalent queries.

Threshold guidance (from production deployments):

  • >= 0.98: Conservative; ~10–15% hit rate, virtually no mismatches. Best for high-sensitivity domains (medical, legal).
  • >= 0.95: Standard; ~40–50% hit rate, rare mismatches. Suitable for Q&A, support, general use.
  • >= 0.90: Aggressive; ~60–70% hit rate, noticeable false positives (5–10% of returned answers are unrelated). Use only if re-checking or user feedback is available.

Article 6 covers empirical tuning for your specific domain.

Distance Metrics Beyond Cosine Similarity

While cosine similarity dominates semantic caching, alternatives exist:

MetricUse CaseProsCons
Cosine similarityStandard for all embeddingsFast, intuitive (1.0 = identical), scale-invariantIgnores magnitude differences
Euclidean distanceClustering, exact retrievalGeometric intuitionSlower, magnitude-sensitive; harder to threshold
Manhattan distanceHigh-dimensional sparse dataFaster on sparse vecsLess intuitive for dense embeddings
Learned distance (e.g., contrastive loss)Domain-specific tuningOptimized for your domainRequires labeled training data

Cosine similarity is the default because embeddings are usually unit-normalized (magnitude 1.0), making cosine similarity equivalent to a fast dot-product lookup, and the resulting scores (0.0–1.0) are easy to interpret and threshold.

Key Takeaways

  • Embeddings convert text to fixed-size vectors (e.g., 1536 dimensions); semantic caches store embeddings as keys instead of strings, enabling fuzzy matching.
  • Cosine similarity (values from -1.0 to 1.0) measures vector distance; a threshold >= 0.95 is standard for cache hits, with trade-offs between hit rate and accuracy.
  • OpenAI's text-embedding-3-small is the production standard (cost, quality, speed balance); open models like all-MiniLM-L6-v2 are free alternatives for cost-sensitive systems.
  • Embedding quality directly impacts cache hit rate and false-positive rate; benchmarking your embedding model on your domain is essential before production deployment.

Frequently Asked Questions

Can I cache embeddings so I don't re-embed the same query twice?

Yes, and you should. Cache embeddings in Redis or memory alongside cached responses. This reduces embedding API calls by 30–50% without extra storage overhead (embeddings are small: 1536 floats = ~6 KB).

What is the storage cost for embeddings?

A 1536-dimensional embedding is roughly 6 KB (1536 floats at 4 bytes each). For 100K cached queries, expect ~600 MB disk space. Vector databases add overhead, so plan for 1–2 GB per 100K entries including metadata and indexing.

Why not just use full-text search instead of embeddings?

Full-text search (Elasticsearch, keyword matching) catches exact phrases but misses paraphrases and synonyms. Embeddings capture intent; a user asking "How do I debug async code?" will not hit a cached response for "Async debugging tips" in full-text search, but will in semantic caching (cosine similarity >= 0.95).

Do normalized embeddings always have magnitude 1.0?

Most modern embedding APIs (OpenAI, Anthropic) return unit-normalized vectors. Check your model's documentation. If not normalized, normalize before similarity computation: normalized = v / ||v||.

Can I fine-tune embeddings for my domain?

Yes. After deploying a base model, collect cache hits and misses, label them, and fine-tune the embedding model or train a learned distance metric using contrastive loss. This can improve hit rate by 10–20% but requires labeled data and retraining.

Further Reading