Debugging Non-Determinism in LLM Pipelines
Your LLM application worked yesterday but now produces different outputs. Tests pass locally, fail in CI. Snapshots fail randomly. Non-determinism is creeping in, but where? This article teaches you to diagnose the root cause: logging strategies, systematic checking, and targeted fixes.
The most common sources of non-determinism are: (1) missing or inconsistent seed, (2) temperature not pinned, (3) changed prompt text, (4) different model version, (5) API rate-limiting causing fallback behavior, (6) non-deterministic preprocessing. Systematic debugging eliminates these one by one.
Determinism Debugging Checklist
Start with this checklist. Check each item methodically:
- Is temperature pinned? Check your API call. Is temperature hardcoded (e.g.,
temperature=0.7)? Or is it omitted (defaults to 1.0)? Or coming from a variable that might change?
# BAD: temperature not pinned
response = client.messages.create(model="...", messages=[...]) # temperature defaults to 1.0
# GOOD: temperature pinned
response = client.messages.create(
model="...",
messages=[...],
temperature=0.7 # Explicit
)
# RISKY: temperature from config
response = client.messages.create(
model="...",
messages=[...],
temperature=CONFIG["temperature"] # Could change!
)
- Is seed being used? If your API supports seed (OpenAI, Anthropic if using beta), are you passing it?
# Check if seed is passed
response = client.messages.create(
model="...",
messages=[...],
seed=42 # Must be here for reproducibility
)
- Is the model pinned? Not using auto-updating aliases like
gpt-4orclaude-3-opus(without date). These change over time.
# BAD: auto-updating
response = client.messages.create(model="gpt-4", ...)
# GOOD: pinned version
response = client.messages.create(model="gpt-4-0125-preview", ...)
- Is the prompt text deterministic? Is the prompt constructed from variables that might change? Whitespace differences (extra newlines, spaces) matter.
# BAD: prompt changes based on user input (no caching)
prompt = f"Explain {user_topic} in 100 words."
# GOOD: prompt template is fixed; only values change
PROMPT_TEMPLATE = "Explain {topic} in exactly 100 words."
prompt = PROMPT_TEMPLATE.format(topic=user_topic) # user_topic is still variable
- Is preprocessing deterministic? Do you clean, tokenize, or transform input before sending to LLM? If so, is that transformation deterministic?
# BAD: non-deterministic preprocessing
cleaned = user_input.strip().lower().replace("\n", " ") # `.lower()` is deterministic
# But what if user_input has random whitespace?
# GOOD: normalize input deterministically
import re
cleaned = re.sub(r'\s+', ' ', user_input.strip()) # Always collapse whitespace the same way
- Are you retrying with backoff? If so, are you maintaining the seed across retries?
import random
# BAD: seed changes on retry
for attempt in range(3):
try:
response = client.messages.create(
model="...",
messages=[...],
seed=random.randint(0, 2**31 - 1) # NEW seed each retry = non-determinism
)
return response
except RateLimitError:
time.sleep(2 ** attempt)
# GOOD: seed stays constant across retries
for attempt in range(3):
try:
response = client.messages.create(
model="...",
messages=[...],
seed=42 # Same seed every retry
)
return response
except RateLimitError:
time.sleep(2 ** attempt)
- Is the API actually honoring your parameters? Some APIs ignore seed if temperature > 1.0, or vice versa. Check the response object.
# After getting response, log what was actually used
print(f"Temperature used: {response.model_dump().get('temperature', 'unknown')}")
print(f"Seed echo: {response.model_dump().get('seed', 'not provided')}")
Comprehensive Logging for Diagnosis
Add logging at every step to track where non-determinism enters:
import logging
import json
from datetime import datetime
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
def llm_query_with_logging(prompt, temperature=0.7, seed=42, model="claude-3-5-sonnet-20241022"):
"""LLM query with comprehensive logging."""
# Log input
logger.debug(f"[LLM Query Start] {datetime.utcnow().isoformat()}")
logger.debug(f" model: {model}")
logger.debug(f" temperature: {temperature}")
logger.debug(f" seed: {seed}")
logger.debug(f" prompt_hash: {hash(prompt) % 1000000}") # Summarize prompt
logger.debug(f" prompt_length: {len(prompt)}")
try:
response = client.messages.create(
model=model,
max_tokens=200,
temperature=temperature,
seed=seed,
messages=[{"role": "user", "content": prompt}]
)
output = response.content[0].text
# Log output and metadata
logger.debug(f" output_hash: {hash(output) % 1000000}")
logger.debug(f" output_length: {len(output)}")
logger.debug(f" tokens_used: {response.usage.input_tokens + response.usage.output_tokens}")
logger.debug(f"[LLM Query End] Success")
return output
except Exception as e:
logger.error(f" error: {type(e).__name__}: {e}")
logger.error(f"[LLM Query End] Failed")
raise
# For debugging, dump full responses to a file
def llm_query_with_full_dump(prompt, temperature=0.7, seed=42):
"""LLM query with full response dumped to file."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=temperature,
seed=seed,
messages=[{"role": "user", "content": prompt}]
)
# Dump entire response for inspection
dump = {
"timestamp": datetime.utcnow().isoformat(),
"input": {
"model": "claude-3-5-sonnet-20241022",
"temperature": temperature,
"seed": seed,
"prompt": prompt[:500], # First 500 chars
},
"output": {
"text": response.content[0].text,
"usage": {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
}
}
with open("llm_responses.jsonl", "a") as f:
f.write(json.dumps(dump) + "\n")
return response.content[0].text
Run your application and collect the logs. Then analyze for patterns:
grep "LLM Query Start" llm_responses.jsonl | wc -l # How many queries?
grep "output_hash:" llm_responses.jsonl | sort | uniq -c # How many unique outputs?
If you see the same prompt_hash + temperature + seed producing different output_hashes, that's a red flag: the API is not being deterministic, or something else changed between calls.
Diff Analysis: Comparing Outputs
When outputs differ unexpectedly, compare them side-by-side:
import difflib
def compare_responses(response1: str, response2: str) -> None:
"""Print diff between two LLM responses."""
lines1 = response1.split('\n')
lines2 = response2.split('\n')
diff = difflib.unified_diff(
lines1,
lines2,
lineterm='',
fromfile='Response 1',
tofile='Response 2'
)
print('\n'.join(diff))
# Similarity score
matcher = difflib.SequenceMatcher(None, response1, response2)
similarity = matcher.ratio()
print(f"\nSimilarity: {similarity:.1%}")
# Usage
resp1 = llm_query("Explain Python decorators.", seed=42)
resp2 = llm_query("Explain Python decorators.", seed=42)
if resp1 != resp2:
print("Outputs differ!")
compare_responses(resp1, resp2)
else:
print("Outputs are identical. Determinism is working.")
Output might look like:
--- Response 1
+++ Response 2
@@ -1,3 +1,3 @@
A decorator is a function that takes another function as input...
-and returns a modified version with enhanced functionality.
+and returns an enhanced, modified function.
Use @property for computed attributes.
Similarity: 0.95
95% similarity with minimal changes usually means the model is behaving deterministically (minor phrasing variations are normal even with seed, depending on the API). If similarity is <70%, something changed significantly.
Isolating the Source: Bisect Approach
If you can't pinpoint the issue, isolate it by testing components:
def test_determinism_isolation():
"""Test each component separately."""
# 1. Test LLM API directly (no app code)
print("Testing raw API determinism...")
resp1 = client.messages.create(
model="claude-3-5-sonnet-20241022",
temperature=0.7,
seed=42,
messages=[{"role": "user", "content": "Say hello"}]
)
resp2 = client.messages.create(
model="claude-3-5-sonnet-20241022",
temperature=0.7,
seed=42,
messages=[{"role": "user", "content": "Say hello"}]
)
assert resp1.content[0].text == resp2.content[0].text, "API not deterministic"
print("✓ Raw API is deterministic")
# 2. Test prompt construction
print("Testing prompt construction...")
prompt1 = construct_prompt("input1")
prompt2 = construct_prompt("input1")
assert prompt1 == prompt2, "Prompt construction non-deterministic"
print("✓ Prompt construction is deterministic")
# 3. Test preprocessing
print("Testing preprocessing...")
processed1 = preprocess("test input")
processed2 = preprocess("test input")
assert processed1 == processed2, "Preprocessing non-deterministic"
print("✓ Preprocessing is deterministic")
# 4. Test end-to-end
print("Testing end-to-end application...")
app_result1 = my_app.process("query")
app_result2 = my_app.process("query")
if app_result1 != app_result2:
print("✗ End-to-end not deterministic")
print(f" Result 1: {app_result1[:100]}...")
print(f" Result 2: {app_result2[:100]}...")
else:
print("✓ End-to-end is deterministic")
test_determinism_isolation()
Quick Fixes
Once you identify the source:
Missing seed: Add seed=42 to your LLM call.
Temperature not pinned: Add temperature=0.7 (or your chosen value).
Non-deterministic preprocessing: Use str.strip(), re.sub(r'\s+', ' ', ...), or similar to normalize consistently.
Changing prompt: Wrap prompt in a function to ensure it's the same every time.
Model alias: Replace model="gpt-4" with model="gpt-4-0125-preview".
Retry with new seed: Use the same seed across retry attempts.
Key Takeaways
- Non-determinism usually comes from: missing seed, unfixed temperature, model alias, changed prompt, or non-deterministic preprocessing.
- Use a systematic checklist to rule out causes one by one.
- Log comprehensively: every LLM query should log its inputs and outputs to a file for later analysis.
- Use diff tools to compare actual outputs; similarity scores reveal how different they are.
- Test components in isolation (API, prompt construction, preprocessing, end-to-end) to narrow down the issue.
Frequently Asked Questions
My seed works on one API but not another. Why?
Not all APIs support seed (or honor it correctly). OpenAI and Anthropic support it; others may not. Check your provider's docs. Also, even with seed, floating-point arithmetic differences across hardware can cause tiny variations. Use tolerance assertions for production.
My logs show the same seed and temperature, but outputs still differ. What's happening?
Possible causes: (1) The prompt is actually different (check character-by-character), (2) The model was updated (version changed), (3) The API has a bug (rare but happens), (4) Network issues caused a different code path. Add hash of the prompt to logs to verify it's truly identical.
Is it okay to have minor variations even with seed set?
With some APIs and models, seed reduces but doesn't eliminate variation (especially at lower temperatures). If variation is <5%, it's usually acceptable. For exact determinism, use temperature = 0.0 (greedy decoding).