Skip to main content

Snapshot Testing LLM Responses: Step-by-Step

Snapshot testing captures the expected output of an LLM call in a "golden file" (a reference version of correct behavior), then compares future outputs against that snapshot. If the output changes, the test fails, alerting you to a regression. This article teaches you how to implement snapshot testing for LLM applications, automate QA, and catch breaking changes before they reach users.

Snapshot testing is ideal for LLM systems because outputs are deterministic (via seed + temperature) but still human-readable and hard to validate algorithmically. Instead of writing brittle regex assertions, you capture the expected output once, verify it with a human, and then let the test framework handle comparisons.

Why Snapshot Testing Matters for LLMs

Without snapshots, you might write assertions like:

def test_summarization():
response = llm_summarize("A long article about climate change...")
assert "climate" in response.lower() # Too weak
assert len(response) < 500 # Too loose
assert response.startswith("Summary:") # Too specific

These assertions are unreliable. The first passes for almost any output; the second ignores actual quality. Snapshots flip the paradigm:

def test_summarization(snapshot):
response = llm_summarize("A long article about climate change...")
snapshot.assert_match(response) # Exact match; fails if output changes

The snapshot captures the actual output once (reviewed by a human), then future runs must match exactly. If the model, temperature, or prompt changes, the test fails, protecting you from silent regressions.

Setting Up Snapshot Testing with Pytest

Python's most popular snapshot library is syrupy. Install it:

pip install syrupy

Here's a basic snapshot test for an LLM prompt:

import pytest
from anthropic import Anthropic

@pytest.fixture
def llm_client():
return Anthropic(api_key="your-key")

def test_summarize_article_snapshot(llm_client, snapshot):
"""Test that article summarization matches the golden snapshot."""

article_text = """
The history of artificial intelligence spans decades. In the 1950s, Alan Turing
proposed the famous Turing Test as a measure of machine intelligence. Since then,
AI has evolved through multiple waves: expert systems in the 1980s, machine learning
in the 2000s, and deep learning in the 2010s. Today, large language models represent
the cutting edge of AI research and deployment.
"""

response = llm_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=150,
temperature=0.5, # Fixed for reproducibility
messages=[
{
"role": "user",
"content": f"Summarize the following in 3 sentences:\n\n{article_text}"
}
]
)

summary = response.content[0].text
snapshot.assert_match(summary, extension=".txt")

On first run, syrupy creates a snapshot file:

tests/__snapshots__/test_llm_responses.ambr

Inside:

snapshots['test_summarize_article_snapshot 1'] = 'Large language models have evolved through multiple waves of AI research. From the Turing Test in the 1950s to expert systems and machine learning, AI technology has advanced significantly. Today, modern LLMs represent the state-of-the-art in AI capabilities and deployment.'

On subsequent runs, if the output matches, the test passes. If it differs, the test fails and shows a diff:

AssertionError: Snapshot does not match.
- Large language models have evolved...
+ Large language models have advanced...

Approving Updated Snapshots

When you intentionally change the prompt, temperature, or model, the snapshot test fails. You must approve the new output:

pytest --snapshot-update

This updates the golden file to the new output. Critical: Always review the diff before approving. Run:

pytest --snapshot-review

This opens an interactive review where you see the diff, accept or reject each change, and optionally edit the new snapshot before saving.

Advanced: Snapshot Testing Multi-Turn Conversations

For chatbot snapshots, you often want to test the entire conversation history. Here's a pattern:

def test_multi_turn_conversation_snapshot(llm_client, snapshot):
"""Test a multi-turn conversation with snapshots."""

messages = []

# Turn 1: User asks about Python
messages.append({"role": "user", "content": "What is Python?"})
response1 = llm_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=0.6,
messages=messages
)
assistant_reply1 = response1.content[0].text
messages.append({"role": "assistant", "content": assistant_reply1})

# Turn 2: User asks a follow-up
messages.append({"role": "user", "content": "What are its main use cases?"})
response2 = llm_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=0.6,
messages=messages
)
assistant_reply2 = response2.content[0].text

# Snapshot the entire conversation
conversation = {
"turn_1_answer": assistant_reply1,
"turn_2_answer": assistant_reply2
}

snapshot.assert_match(conversation)

The snapshot captures:

snapshots['test_multi_turn_conversation_snapshot 1'] = {
'turn_1_answer': 'Python is a high-level, interpreted programming language...',
'turn_2_answer': 'Python is widely used in web development, data science...'
}

Handling Acceptable Variations

Sometimes minor formatting changes are acceptable (e.g., "Python 3.11" vs. "Python 3.12" in the output). For these cases, use regex or custom matchers:

import re
from syrupy import SnapshotAssertion

def test_version_agnostic_snapshot(llm_client, snapshot: SnapshotAssertion):
"""Test snapshot while normalizing version numbers."""

response = llm_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{"role": "user", "content": "What Python version should I use?"}]
)

output = response.content[0].text
# Normalize versions: "Python 3.12" -> "Python X.Y"
normalized = re.sub(r'Python \d+\.\d+', 'Python X.Y', output)

snapshot.assert_match(normalized)

Integration with CI/CD

Add snapshot testing to your CI pipeline:

# .github/workflows/test.yml
name: Tests
on: [push, pull_request]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install pytest syrupy anthropic
- run: pytest --snapshot-warn-on-failure
# warn, don't fail, so CI shows what changed (not ideal)
# In PR review, maintainer approves changes via --snapshot-update

In your PR workflow: if snapshots differ, the test warns (doesn't fail). In the PR description, you document why outputs changed (e.g., "Upgraded to Claude 3.5 Sonnet"). Reviewers verify the diffs and approve the update.

Best Practices

1. Fix temperature and seed before capturing snapshots. Randomness ruins snapshot testing. Always use deterministic parameters (temperature 0.5–0.7, fixed seed).

2. Capture snapshots on a stable version. Don't snapshot outputs from a beta model or unstable branch. Use the same model version as production.

3. Review snapshots in code review. Make snapshot changes visible in PRs. Reviewers must explicitly approve new outputs before merge.

4. Version your snapshots. Include the model name and version in snapshot metadata:

def test_summarize_snapshot(llm_client, snapshot):
response = llm_client.messages.create(
model="claude-3-5-sonnet-20241022", # Document model version here
max_tokens=150,
temperature=0.5,
messages=[...],
)
snapshot.assert_match(response.content[0].text)

5. Re-snapshot after model upgrades (intentionally). When upgrading from Claude 3 to Claude 3.5, run --snapshot-update, review diffs (expect improvements), and commit the new baselines.

Key Takeaways

  • Snapshot testing captures expected LLM outputs in golden files, then compares future runs against them. This detects regressions automatically.
  • Use deterministic LLM settings (fixed temperature, seed, model version) before capturing snapshots, or snapshot testing becomes flaky.
  • syrupy is the recommended Python snapshot library; it provides interactive review (--snapshot-review) and approval (--snapshot-update).
  • For multi-turn conversations, snapshot the entire conversation history to catch breaks in dialogue coherence.
  • Integrate snapshots into CI/CD and require code review before approving snapshot changes.

Frequently Asked Questions

What if the model produces slightly different output each time despite fixed seed?

That suggests your temperature is too high or your seed isn't being honored by the API. Lower the temperature (0.2–0.3) and verify the API accepts your seed parameter. If differences persist, investigate the API provider's randomness behavior.

Should I snapshot the entire API response or just the text content?

Just the text content. The full response includes metadata (usage, stop_reason) that may vary. Snapshot only what matters for your application.

How do I version snapshots when I have multiple models?

Use different test functions per model:

def test_summarize_with_claude(llm_client, snapshot):
# Claude snapshots
snapshot.assert_match(...)

def test_summarize_with_gpt4(llm_client, snapshot):
# GPT-4 snapshots
snapshot.assert_match(...)

Or include model in the snapshot name:

snapshot.assert_match(response, extension=".claude-3-5-sonnet")

Can I snapshot images or other media?

Not with syrupy text snapshots. For multimodal outputs, snapshot the text description or metadata instead.

Further Reading