Prompt Caching Strategies (LLM Guide)
Prompt caching eliminates redundant work by reusing the KV cache (computed attention matrices) for prompt text that is identical or shared across requests. If ten users ask questions about the same document (a PDF, a code repository, a legal contract), the model processes the document context ten times separately, wasting compute. Prompt caching stores the computed context once and reuses it for all ten requests, reducing TTFT from 500ms to 50ms (10x) for cache-hit requests. This is one of the highest-leverage optimizations for applications with repeated context (RAG systems, document analysis, multi-turn chat).
How Prompt Caching Works
Prompt caching exploits a key fact: if two requests have identical prompt prefixes, their KV caches (attention outputs after processing the prefix) are also identical. The caching strategy:
- Hash the prompt prefix: Compute a deterministic hash of the first N tokens of the prompt.
- Check cache: Look up whether this prefix's KV cache has been computed before.
- Cache hit: If found, load the cached KV and skip prefill. Begin decoding immediately.
- Cache miss: If not found, compute prefill normally, store the resulting KV cache, and move to decode.
A typical setup caches the first 2000-4000 tokens (the "context") and processes only new queries against that cached context. For example:
Request 1: [Context: 2000 tokens] [Query: "What is the main topic?"] → miss, compute cache
Request 2: [Context: 2000 tokens] [Query: "Summarize in 50 words"] → hit, reuse cache
Request 3: [Context: 2000 tokens] [Query: "List key points"] → hit, reuse cache
Prompt Caching with Claude API
Anthropic's Claude API (claude-3-5-sonnet) includes built-in prompt caching with no additional implementation required. You specify which parts of your prompt are cached via the cache_control parameter:
import anthropic
client = anthropic.Anthropic()
# Long context document to cache
context = """
# Technical Documentation: Distributed Systems
## Chapter 1: Consensus Algorithms
[10,000 words about Paxos, Raft, etc...]
## Chapter 2: Fault Tolerance
[10,000 words about Byzantine failures, recovery, etc...]
"""
# First request: cache miss (context is computed and stored)
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a technical documentation expert."
},
{
"type": "text",
"text": context,
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[
{
"role": "user",
"content": "What is Raft consensus and how does it ensure fault tolerance?"
}
]
)
print("First request (cache miss):")
print(f"Tokens used: {response.usage.input_tokens}")
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Response: {response.content[0].text[:200]}...")
# Second request: cache hit (context reused, only query tokens processed)
response2 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a technical documentation expert."
},
{
"type": "text",
"text": context,
"cache_control": {"type": "ephemeral"} # Reuse cached block
}
],
messages=[
{
"role": "user",
"content": "Explain Byzantine fault tolerance in simple terms."
}
]
)
print("\nSecond request (cache hit):")
print(f"Cache read tokens (reused): {response2.usage.cache_read_input_tokens}")
print(f"New input tokens: {response2.usage.input_tokens}")
print(f"Tokens saved: {response.usage.cache_creation_input_tokens - response2.usage.input_tokens}")
Output:
First request (cache miss):
Tokens used: 10150
Cache write tokens: 10000
Response: "Raft is a consensus algorithm that..."
Second request (cache hit):
Cache read tokens (reused): 10000
New input tokens: 150
Tokens saved: 9850
The Claude API offers two caching modes:
- ephemeral: Cache persists for 5 minutes within a single conversation. Use for rapid multi-turn queries on the same document.
- persistent: (Enterprise only) Cache persists indefinitely across conversations. Ideal for large shared documents (company handbook, legal contracts, etc.).
Prefix Caching in vLLM
For self-hosted models, vLLM's prefix caching provides similar functionality:
from vllm import LLM, SamplingParams
# Enable prefix caching in the engine
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
enable_prefix_caching=True, # Enable automatic prefix caching
dtype="float16",
)
sampling_params = SamplingParams(max_tokens=256)
# Shared context (e.g., a document)
context = """
# Deep Learning Fundamentals
## Backpropagation
[5000 words explaining backprop, gradients, optimization...]
## Regularization
[3000 words on L1/L2, dropout, batch norm...]
"""
# Request 1: Cache miss
prompt1 = f"{context}\n\nQuestion: Explain backpropagation."
output1 = llm.generate(prompts=[prompt1], sampling_params=sampling_params)
print(f"Request 1 TTFT: {output1[0].metrics['time_to_first_token']:.1f}ms")
# Request 2: Cache hit (same context, different query)
prompt2 = f"{context}\n\nQuestion: What is dropout?"
output2 = llm.generate(prompts=[prompt2], sampling_params=sampling_params)
print(f"Request 2 TTFT: {output2[0].metrics['time_to_first_token']:.1f}ms (cached)")
# Expected: Request 1 ~300ms, Request 2 ~50ms (6x speedup)
vLLM automatically detects identical prompt prefixes and reuses their KV caches. You do not need explicit hash or cache-lookup logic; the engine handles it internally.
Prompt Caching Patterns for Real Applications
Pattern 1: Document Q&A (RAG)
Shared prefix: [System prompt] [Retrieval result 1] [Retrieval result 2] ... [Retrieval result K]
Per-request suffix: [User question]
TTFT with caching: 100ms (only user question is processed)
TTFT without caching: 1500ms (entire prompt reprocessed)
Savings: 1400ms per query
Pattern 2: Multi-Turn Chat
Request 1: [System prompt] [User message 1]
Request 2: [System prompt] [User message 1] [Assistant response 1] [User message 2]
Request 3: [System prompt] [User message 1] [Assistant response 1] [User message 2] [Assistant response 2] [User message 3]
With prompt caching:
- Request 1: ~200ms TTFT (no cache)
- Request 2: ~50ms TTFT (reuse system + msg 1, process new response + msg 2)
- Request 3: ~50ms TTFT (reuse system + msg 1-2, process new response + msg 3)
Pattern 3: Batch Analysis
Shared context: [System prompt describing a code review task]
Per-request: [Code snippet N]
Process 100 code snippets with same system prompt:
- First request: Full TTFT (~500ms)
- Requests 2-100: 50ms each TTFT (system prompt cached)
Total time: 500ms + 99 * 50ms = 5.45 seconds
Without caching: 100 * 500ms = 50 seconds (9x slower)
Managing Prompt Cache Eviction
Cached prompts consume GPU memory (KV cache), so you cannot cache everything indefinitely. Common eviction strategies:
| Strategy | Pros | Cons |
|---|---|---|
| LRU (Least Recently Used) | Fair; unused caches evicted first | Unpredictable for bursty traffic |
| LFU (Least Frequently Used) | Popular contexts stay cached | Complex bookkeeping |
| TTL (Time-to-Live) | Simple; caches expire after 5-60 mins | Misses if reused after TTL |
| Size-aware | Prioritize large contexts | Requires tuning per application |
For the Claude API, ephemeral caching automatically expires after 5 minutes of disuse, so you do not manage eviction. For vLLM, configure cache capacity:
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
enable_prefix_caching=True,
gpu_memory_utilization=0.8, # 80% GPU for model + cache
)
vLLM's scheduler manages eviction automatically when GPU memory fills.
Measuring Prompt Cache Effectiveness
To quantify cache impact on your workload:
import time
from collections import defaultdict
def benchmark_caching(requests_with_contexts):
"""
Benchmark cache hit rate and TTFT improvement.
requests_with_contexts: list of (context, query) tuples
"""
cache = {}
ttfts = []
cache_hits = 0
for context, query in requests_with_contexts:
context_hash = hash(context)
prompt = f"{context}\n\nQuery: {query}"
t0 = time.perf_counter()
if context_hash in cache:
# Cache hit: skip prefill
cache_hits += 1
ttft = 50 # Simulated; in reality, only process query
else:
# Cache miss: full prefill
cache[context_hash] = True
ttft = 300 # Simulated; full context processing
t1 = time.perf_counter()
ttfts.append(ttft)
hit_rate = cache_hits / len(requests_with_contexts)
avg_ttft = sum(ttfts) / len(ttfts)
print(f"Cache hit rate: {hit_rate * 100:.1f}%")
print(f"Avg TTFT: {avg_ttft:.1f}ms")
return hit_rate, avg_ttft
A well-optimized RAG system with prompt caching should achieve 70-90% cache hit rate on repeated documents.
Key Takeaways
- Prompt caching reuses KV for identical prefixes: Eliminates redundant prefill compute.
- 10x TTFT speedup for cache hits: Unmatched performance gain for repeated contexts.
- Claude API: use
cache_control: {"type": "ephemeral"}: Automatic, no infrastructure required. - vLLM: enable
enable_prefix_caching=True: Automatic detection and reuse. - Ideal for RAG, multi-turn chat, batch analysis: Anywhere context is shared across requests.
Frequently Asked Questions
How much does prompt caching cost with Claude API?
Claude API charges differently for cached tokens: 90% discount on reads (you pay 10% of normal token cost). Cache writes (storing) cost normal rate for 5 minutes, then drops to 10% rate. Example: 10,000-token context, 1,000-token query. First request: 10,000 * normal + 1,000 * normal. Second request (hit): 10,000 * 0.1 + 1,000 * normal. Savings: 9,000 tokens per request.
Can I cache prompts that differ slightly (e.g., different formatting)?
No, caching requires exact match. "What is AI?" and "what is AI?" are different strings. Normalize prompts before caching (lowercase, remove extra whitespace).
Does caching work across different users or only within a user session?
With Claude API ephemeral caching, caches are per-user session (5-minute window). Persistent caching (enterprise) can be shared across all users. With vLLM, caches are global (shared across all users).
What if my context is larger than GPU memory?
vLLM pages KV cache to CPU memory (slower), but continues. Claude API has a max context window (200K tokens for claude-3-5-sonnet), but does not cache beyond available memory. Design around these limits.