Skip to main content

Prompt Caching for Consistent LLM Behavior

Prompt caching is a technique where you send large, static portions of your prompt to an LLM API once, and the API caches them server-side for subsequent requests. This delivers three major benefits: faster inference (no re-processing cached content), lower token costs (cached tokens cost ~90% less), and improved reproducibility (the same cached prompt + same query always produces consistent behavior). This article shows you how to implement prompt caching for deterministic LLM applications.

Why Caching Improves Reproducibility

Consider a common scenario: you have a 10,000-token system prompt (a long set of instructions, examples, and context) that you use for every user query. Without caching, every request re-processes those 10,000 tokens, consuming time and money. More importantly, if you change even one character in the system prompt, all future requests behave differently—and old requests are impossible to reproduce because you'd need the exact same version of the prompt to match the cached version.

Prompt caching forces you to be explicit about what changes and what stays the same. You version your static prompt independently from dynamic user input. This separation makes reproducibility explicit: if you cache prompt version 1.0 and a user query, you can later repeat the exact same request (same cache version + same query) and get identical output (assuming the model, temperature, and seed are the same).

Implementing Prompt Caching with Anthropic Claude

Anthropic's Claude API supports prompt caching via the cache_control parameter. Here's how to use it:

import anthropic

client = anthropic.Anthropic(api_key="your-key")

# Static system prompt (version 1.0)
system_prompt = """You are an expert technical writer specializing in Python performance optimization.
Your goal is to provide clear, accurate, and actionable advice.

Always follow these rules:
1. Use concrete examples with measured benchmarks.
2. Provide step-by-step optimization steps.
3. Explain the trade-offs for each optimization.

Example Q&A:
Q: How do I optimize list comprehensions?
A: Use list comprehensions instead of loops when possible; they're 2-3x faster.
Benchmark on your data because micro-optimizations vary by workload."""

def cached_query(user_question, cache_control=True):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} if cache_control else None
}
] if cache_control else [{"type": "text", "text": system_prompt}],
messages=[
{"role": "user", "content": user_question}
],
temperature=0.7
)

# Log cache metrics for transparency
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

return response.content[0].text

# First request: creates cache (incurs full token cost)
result1 = cached_query("What's the fastest way to sort large lists in Python?")
print("First request (cache miss):")
print(result1)
print()

# Second request with same system prompt: reads from cache (90% discount on cached tokens)
result2 = cached_query("How do I optimize NumPy operations?")
print("Second request (cache hit):")
print(result2)

The key is the cache_control parameter set to {"type": "ephemeral"}. This tells Anthropic to cache the system prompt for 5 minutes. Subsequent requests re-using the same system prompt will hit the cache.

Implementing Prompt Caching with OpenAI GPT

OpenAI uses a simpler model: caching is automatic for long prompts. Any prompt or message sequence longer than 1,024 tokens becomes eligible for caching. Here's an example:

from openai import OpenAI

client = OpenAI(api_key="your-key")

# Static prompt: 2000+ tokens of instruction
system_prompt = """You are a code reviewer. Analyze the provided code for bugs, security issues, and performance problems.
[... 2000 tokens of detailed instructions, examples, rubric ...]"""

def review_with_cache(code_snippet, seed=42):
response = client.messages.create(
model="gpt-4o-mini",
temperature=0.3,
seed=seed,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Review this code:\n\n{code_snippet}"}
]
)
return response.content[0].text

code1 = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""

code2 = """
def binary_search(arr, target):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1
"""

# First review: caches the long system prompt
review1 = review_with_cache(code1, seed=42)
print("Review 1 (cache creation):")
print(review1)
print()

# Second review: cache is reused (faster, cheaper)
review2 = review_with_cache(code2, seed=43)
print("Review 2 (cache reuse):")
print(review2)

OpenAI caches for 5 minutes by default. The cache is transparent: you don't need to manually enable it. Just use the same system prompt (or message prefix) in multiple requests, and OpenAI automatically caches and reuses it.

Versioning Cached Prompts for Reproducibility

To ensure reproducibility, version your static prompt independently. Use a semantic version (major.minor.patch) and include it in your logs:

import anthropic
import json
from datetime import datetime

class PromptCache:
def __init__(self, version, prompt_text):
self.version = version
self.prompt_text = prompt_text
self.cached_at = None

def query(self, user_question, temperature=0.7, seed=42):
client = anthropic.Anthropic(api_key="your-key")

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system=[
{
"type": "text",
"text": self.prompt_text,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_question}],
temperature=temperature,
seed=seed
)

# Log cache info
metadata = {
"prompt_version": self.version,
"temperature": temperature,
"seed": seed,
"timestamp": datetime.utcnow().isoformat(),
"cache_creation_tokens": response.usage.cache_creation_input_tokens,
"cache_read_tokens": response.usage.cache_read_input_tokens
}

return response.content[0].text, metadata

# Prompt v1.0
prompt_v1 = PromptCache("1.0", "You are a helpful Python expert...")
result, meta = prompt_v1.query("How do I use decorators?")
print(json.dumps(meta, indent=2))

# If you update the prompt, bump the version
# Prompt v1.1
prompt_v1_1 = PromptCache("1.1", "You are a world-class Python expert specializing in async code...")
result, meta = prompt_v1_1.query("How do I use decorators?")
# Different version = different cache = possibly different output

This versioning strategy ensures:

  • You can always reproduce a response by re-running with the same prompt version.
  • Cache invalidation is explicit (bump the version).
  • Audit logs track which prompt version generated which response.

Cache Eviction and Consistency

Be aware of cache expiration:

  • Anthropic: 5 minutes for ephemeral caches (sufficient for most use cases; longer durations coming soon).
  • OpenAI: 5 minutes (but integration may vary; check current docs).

If your cache expires between requests, you lose the cost savings but not reproducibility (assuming you use seed + fixed temperature). However, if your application expects consistent latency (e.g., SLA of 500ms per request), cache expiration may cause outliers. Design your system to tolerate cache misses or pre-warm caches if latency is critical.

Best Practices

1. Separate static from dynamic content. Put all unchanging instructions, examples, and context in the system prompt and enable caching. Keep user queries separate.

2. Version prompts like code. Use semantic versioning; document changes. Don't modify cached prompts without incrementing the version.

3. Log cache metrics. Track cache hits, misses, and cost savings. Use this data to optimize prompt size and TTL.

4. Combine caching with seed and temperature. Lock temperature and seed and cache the prompt. This gives you three layers of reproducibility control.

5. Test cache behavior. Verify that back-to-back requests with the same prompt and seed produce identical output. If not, investigate cache or API inconsistencies.

Key Takeaways

  • Prompt caching stores large, static prompts server-side, reducing latency and cost by 90% on cached tokens.
  • Caching improves reproducibility by making prompt versioning explicit: old requests can be replayed by using the same cached prompt version + query + seed.
  • Anthropic and OpenAI both support caching; Anthropic requires explicit cache_control annotation, OpenAI caches long prompts automatically.
  • Version your static prompts independently from user queries. Bump the version when you change the prompt.
  • Combine caching with seed and temperature for maximum reproducibility control.

Frequently Asked Questions

How long does a cached prompt stay cached?

Anthropic: 5 minutes (default). OpenAI: 5 minutes. Contact your provider for longer TTLs if needed.

If my cache expires, does my output change?

No, assuming temperature and seed are the same. Cache expiration only affects latency and cost, not output. The output depends on the prompt text, not whether it was cached.

What if I accidentally modify a cached prompt while it's still in cache?

The cache stores the old version. New requests still hit the cached old prompt until the 5-minute TTL expires. To force a new cache with the updated prompt, change the prompt text (which automatically invalidates the cache for that text).

Is caching compatible with streaming?

Yes. Caching works with both streaming (Anthropic) and non-streaming (OpenAI) responses. The cached portion is processed before streaming begins.

Further Reading