Snapshot testing for LLM outputs
Snapshot testing for LLM applications captures baseline outputs (called golden files or snapshots) for a fixed set of test inputs, then compares new outputs against the snapshots when model or prompt changes occur. If a new model produces different outputs, the snapshot test fails and shows a diff. You then review the diff and decide: is this change intentional (approve and update the snapshot) or a regression (revert the change)? Snapshot testing is lightweight, catches unexpected behavior changes automatically, and provides a clear audit trail of what changed and when.
How Snapshot Testing Differs from Regression Testing
Snapshot testing and regression testing overlap but serve different purposes. Regression testing measures quality metrics (accuracy, toxicity) and gates deployment if metrics fall below thresholds. Snapshot testing compares outputs verbatim (or near-verbatim via semantic similarity) and requires explicit human approval for changes. Regression testing answers: "Is this good enough?" Snapshot testing answers: "Did this change unexpectedly?" Snapshot testing catches subtle changes that metrics might miss (tone, formatting, specific phrase presence) and ensures visibility: every change to LLM outputs appears in code review.
Setting Up Snapshots
Create a snapshot file for each test case, storing baseline outputs in your Git repository. Use JSON or YAML format so diffs are readable.
# tests/__snapshots__/customer_support.snap.json
[
{
"query": "How do I return an item?",
"expected_output": "To return an item, visit our Returns portal at example.com/returns. Enter your order number and select items. Print the shipping label and drop off at any UPS location. Most refunds process within 5-7 business days."
},
{
"query": "What is your shipping cost?",
"expected_output": "Shipping costs depend on location and speed:\n- Standard (5-7 days): Free on orders over $50, otherwise $7.99\n- Express (2-3 days): $14.99\n- Overnight: $24.99\nInternational shipping available; calculated at checkout."
}
]
Write a test harness that runs your LLM on each query and compares output to the snapshot.
import json
import os
from anthropic import Anthropic
from difflib import unified_diff
client = Anthropic()
def test_llm_snapshots(model: str, system_prompt: str, snapshot_file: str):
"""Test LLM outputs against snapshots."""
with open(snapshot_file) as f:
snapshots = json.load(f)
failed = []
passed = []
for snapshot in snapshots:
query = snapshot["query"]
expected = snapshot["expected_output"]
# Generate output
response = client.messages.create(
model=model,
max_tokens=500,
system=system_prompt,
messages=[{"role": "user", "content": query}]
)
actual = response.content[0].text.strip()
# Compare
if actual == expected:
passed.append(query)
else:
# Generate diff for review
diff = "\n".join(
unified_diff(
expected.split("\n"),
actual.split("\n"),
fromfile="expected",
tofile="actual",
lineterm=""
)
)
failed.append({
"query": query,
"expected": expected,
"actual": actual,
"diff": diff
})
print(f"Snapshot test results: {len(passed)} passed, {len(failed)} failed")
for f in failed:
print(f"FAILED: {f['query']}")
print(f["diff"])
return len(failed) == 0
if __name__ == "__main__":
import sys
passed = test_llm_snapshots(
model="claude-3-5-sonnet-20241022",
system_prompt="You are a helpful customer support assistant.",
snapshot_file="tests/__snapshots__/customer_support.snap.json"
)
sys.exit(0 if passed else 1)
Semantic Snapshot Matching
Exact-match snapshots are too strict for LLMs (same message, different phrasing, is a regression?). Use semantic similarity instead: snapshots store expected text, tests compute embedding similarity and pass if similarity is above a threshold.
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def test_llm_snapshots_semantic(model: str, system_prompt: str, snapshot_file: str, threshold: float = 0.85):
"""Test LLM outputs against snapshots using semantic similarity."""
with open(snapshot_file) as f:
snapshots = json.load(f)
failed = []
for snapshot in snapshots:
query = snapshot["query"]
expected = snapshot["expected_output"]
response = client.messages.create(
model=model,
max_tokens=500,
system=system_prompt,
messages=[{"role": "user", "content": query}]
)
actual = response.content[0].text.strip()
# Compute semantic similarity
expected_emb = embedder.encode(expected)
actual_emb = embedder.encode(actual)
similarity = float(expected_emb @ actual_emb / (len(expected_emb) * len(actual_emb)) ** 0.5)
if similarity < threshold:
failed.append({
"query": query,
"expected": expected,
"actual": actual,
"similarity": similarity,
"threshold": threshold
})
print(f"Snapshot test results: {len(snapshots) - len(failed)} passed, {len(failed)} failed")
for f in failed:
print(f"FAILED: {f['query']} (similarity {f['similarity']:.2f} < {f['threshold']:.2f})")
return len(failed) == 0
Snapshot Review Workflow
When snapshot tests fail, the output goes to code review. The pull request author sees the diff and decides:
- This is a good change → Update the snapshot by re-running the test with
--update-snapshotsflag and committing the updated snapshot file. Add a comment explaining why the change is good. - This is a regression → Revert the prompt or model change, or investigate and fix the underlying issue.
- This is acceptable but different → If outputs are acceptable despite differences, update the snapshot and document the reason in the commit.
Snapshot files should be committed to Git so that every snapshot change is auditable and reviewable.
# Update snapshot on the command line
pytest test_snapshots.py --snapshot-update
# Or manually review and accept in code review
git add tests/__snapshots__/customer_support.snap.json
git commit -m "chore: update snapshots after switching to claude-3-5-sonnet"
Organizing Snapshots by Category
Large snapshot files are hard to review. Organize by category or feature: one snapshot file per LLM capability.
tests/__snapshots__/
├── customer_support.snap.json (support replies)
├── content_generation.snap.json (blog post drafts)
├── code_generation.snap.json (Python code synthesis)
├── summarization.snap.json (document summaries)
└── safety_checks.snap.json (toxicity and harm detection)
When a model update changes only code generation, only the code_generation snapshot needs review, not the entire suite. This keeps code review focused and fast.
Strategies for Snapshot Maintenance
Snapshots require maintenance: model updates, prompt changes, or new test cases all trigger snapshot updates. To keep snapshots manageable:
- Add snapshots only for critical paths: Test the most important use cases (high-traffic queries, safety-critical features). Don't snapshot everything.
- Regenerate snapshots regularly: On major model updates or quarterly, re-capture all snapshots with the current model. This prevents snapshot rot (stale golden files).
- Review snapshot diffs in code review: Always view snapshot updates in pull requests. If a diff looks wrong, ask the author to investigate.
- Keep snapshots in Git: Never store snapshots in an external database or API. Version control ensures history and blame.
- Test snapshot stability: Verify that re-running the same snapshot twice produces identical or nearly-identical results. High variance snapshots are low-signal.
Snapshot Testing in CI/CD
Integrate snapshot testing into your pipeline as a gate that runs on every pull request.
name: Snapshot Tests
on: [pull_request]
jobs:
test_snapshots:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install -r requirements.txt
- run: python -m pytest test_snapshots.py -v
- name: Comment on PR with snapshot diffs
if: failure()
run: |
# Extract diff and post as PR comment (pseudo-code)
DIFFS=$(python -m pytest test_snapshots.py --tb=short 2>&1)
echo "Snapshot test failures: see logs" >> $GITHUB_STEP_SUMMARY
Key Takeaways
- Snapshot testing captures baseline LLM outputs and compares them against new outputs, detecting unexpected changes.
- Use semantic similarity thresholds instead of exact matching for LLM snapshots.
- Snapshot diffs go to code review; author decides whether changes are intentional and approves snapshot updates.
- Organize snapshots by category (customer support, code generation, etc.) for focused review and maintenance.
- Maintain snapshots by regenerating periodically and reviewing all snapshot changes in pull requests.
Frequently Asked Questions
How often should I regenerate snapshots?
Regenerate on major model updates (monthly or quarterly). For routine deployment and prompt tweaks, snapshots remain stable. If you notice snapshot flakiness (outputs differ between runs), investigate the model or system prompt variation and stabilize before re-snapshotting.
Can I snapshot the intermediate outputs of a multi-step chain?
Yes. For RAG systems, snapshot retrieval results, re-ranking outputs, and final generation separately. This isolates which step regressed if a chain test fails. Organize snapshots by step: retrieval.snap.json, reranking.snap.json, generation.snap.json.
Should I store snapshots in Git or a database?
Always Git. Version control provides history, blame, and audit trails. A database is harder to review and loses context. Git diffs show exactly what changed and why (via commit messages).
What similarity threshold should I use?
Start at 0.85 for high-similarity matches (e.g., factual QA where paraphrase is acceptable). For more permissive matching (creative writing, brainstorming), use 0.75. Tune based on false positive rate. If snapshot tests fail 5+ times per week due to threshold, lower it.
Can I use snapshots to compare models?
Yes. Run two models on the same test set, snapshot both, and compare via similarity scores. This shows which model is closer to your baseline behavior. Do this as an experiment (not a blocking gate) before deciding to upgrade.