Debugging Non-Determinism in LLM Pipelines

Your LLM application worked yesterday but now produces different outputs. Tests pass locally, fail in CI. Snapshots fail randomly. Non-determinism is creeping in, but where? This article teaches you to diagnose the root cause: logging strategies, systematic checking, and targeted fixes.

The most common sources of non-determinism are: (1) missing or inconsistent seed, (2) temperature not pinned, (3) changed prompt text, (4) different model version, (5) API rate-limiting causing fallback behavior, (6) non-deterministic preprocessing. Systematic debugging eliminates these one by one.

Determinism Debugging Checklist

Start with this checklist. Check each item methodically:

Is temperature pinned? Check your API call. Is temperature hardcoded (e.g., temperature=0.7)? Or is it omitted (defaults to 1.0)? Or coming from a variable that might change?

# BAD: temperature not pinned
response = client.messages.create(model="...", messages=[...])  # temperature defaults to 1.0

# GOOD: temperature pinned
response = client.messages.create(
    model="...",
    messages=[...],
    temperature=0.7  # Explicit
)

# RISKY: temperature from config
response = client.messages.create(
    model="...",
    messages=[...],
    temperature=CONFIG["temperature"]  # Could change!
)

Is seed being used? If your API supports seed (OpenAI, Anthropic if using beta), are you passing it?

# Check if seed is passed
response = client.messages.create(
    model="...",
    messages=[...],
    seed=42  # Must be here for reproducibility
)

Is the model pinned? Not using auto-updating aliases like gpt-4 or claude-3-opus (without date). These change over time.

# BAD: auto-updating
response = client.messages.create(model="gpt-4", ...)

# GOOD: pinned version
response = client.messages.create(model="gpt-4-0125-preview", ...)

Is the prompt text deterministic? Is the prompt constructed from variables that might change? Whitespace differences (extra newlines, spaces) matter.

# BAD: prompt changes based on user input (no caching)
prompt = f"Explain {user_topic} in 100 words."

# GOOD: prompt template is fixed; only values change
PROMPT_TEMPLATE = "Explain {topic} in exactly 100 words."
prompt = PROMPT_TEMPLATE.format(topic=user_topic)  # user_topic is still variable

Is preprocessing deterministic? Do you clean, tokenize, or transform input before sending to LLM? If so, is that transformation deterministic?

# BAD: non-deterministic preprocessing
cleaned = user_input.strip().lower().replace("\n", " ")  # `.lower()` is deterministic
# But what if user_input has random whitespace?

# GOOD: normalize input deterministically
import re
cleaned = re.sub(r'\s+', ' ', user_input.strip())  # Always collapse whitespace the same way

Are you retrying with backoff? If so, are you maintaining the seed across retries?

import random

# BAD: seed changes on retry
for attempt in range(3):
    try:
        response = client.messages.create(
            model="...",
            messages=[...],
            seed=random.randint(0, 2**31 - 1)  # NEW seed each retry = non-determinism
        )
        return response
    except RateLimitError:
        time.sleep(2 ** attempt)

# GOOD: seed stays constant across retries
for attempt in range(3):
    try:
        response = client.messages.create(
            model="...",
            messages=[...],
            seed=42  # Same seed every retry
        )
        return response
    except RateLimitError:
        time.sleep(2 ** attempt)

Is the API actually honoring your parameters? Some APIs ignore seed if temperature > 1.0, or vice versa. Check the response object.

# After getting response, log what was actually used
print(f"Temperature used: {response.model_dump().get('temperature', 'unknown')}")
print(f"Seed echo: {response.model_dump().get('seed', 'not provided')}")

Comprehensive Logging for Diagnosis

Add logging at every step to track where non-determinism enters:

import logging
import json
from datetime import datetime

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

def llm_query_with_logging(prompt, temperature=0.7, seed=42, model="claude-3-5-sonnet-20241022"):
    """LLM query with comprehensive logging."""
    
    # Log input
    logger.debug(f"[LLM Query Start] {datetime.utcnow().isoformat()}")
    logger.debug(f"  model: {model}")
    logger.debug(f"  temperature: {temperature}")
    logger.debug(f"  seed: {seed}")
    logger.debug(f"  prompt_hash: {hash(prompt) % 1000000}")  # Summarize prompt
    logger.debug(f"  prompt_length: {len(prompt)}")
    
    try:
        response = client.messages.create(
            model=model,
            max_tokens=200,
            temperature=temperature,
            seed=seed,
            messages=[{"role": "user", "content": prompt}]
        )
        
        output = response.content[0].text
        
        # Log output and metadata
        logger.debug(f"  output_hash: {hash(output) % 1000000}")
        logger.debug(f"  output_length: {len(output)}")
        logger.debug(f"  tokens_used: {response.usage.input_tokens + response.usage.output_tokens}")
        logger.debug(f"[LLM Query End] Success")
        
        return output
    
    except Exception as e:
        logger.error(f"  error: {type(e).__name__}: {e}")
        logger.error(f"[LLM Query End] Failed")
        raise

# For debugging, dump full responses to a file
def llm_query_with_full_dump(prompt, temperature=0.7, seed=42):
    """LLM query with full response dumped to file."""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        temperature=temperature,
        seed=seed,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Dump entire response for inspection
    dump = {
        "timestamp": datetime.utcnow().isoformat(),
        "input": {
            "model": "claude-3-5-sonnet-20241022",
            "temperature": temperature,
            "seed": seed,
            "prompt": prompt[:500],  # First 500 chars
        },
        "output": {
            "text": response.content[0].text,
            "usage": {
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
            }
        }
    }
    
    with open("llm_responses.jsonl", "a") as f:
        f.write(json.dumps(dump) + "\n")
    
    return response.content[0].text

Run your application and collect the logs. Then analyze for patterns:

grep "LLM Query Start" llm_responses.jsonl | wc -l  # How many queries?
grep "output_hash:" llm_responses.jsonl | sort | uniq -c  # How many unique outputs?

If you see the same prompt_hash + temperature + seed producing different output_hashes, that's a red flag: the API is not being deterministic, or something else changed between calls.

Diff Analysis: Comparing Outputs

When outputs differ unexpectedly, compare them side-by-side:

import difflib

def compare_responses(response1: str, response2: str) -> None:
    """Print diff between two LLM responses."""
    
    lines1 = response1.split('\n')
    lines2 = response2.split('\n')
    
    diff = difflib.unified_diff(
        lines1,
        lines2,
        lineterm='',
        fromfile='Response 1',
        tofile='Response 2'
    )
    
    print('\n'.join(diff))
    
    # Similarity score
    matcher = difflib.SequenceMatcher(None, response1, response2)
    similarity = matcher.ratio()
    print(f"\nSimilarity: {similarity:.1%}")

# Usage
resp1 = llm_query("Explain Python decorators.", seed=42)
resp2 = llm_query("Explain Python decorators.", seed=42)

if resp1 != resp2:
    print("Outputs differ!")
    compare_responses(resp1, resp2)
else:
    print("Outputs are identical. Determinism is working.")

Output might look like:

--- Response 1
+++ Response 2
@@ -1,3 +1,3 @@
 A decorator is a function that takes another function as input...
-and returns a modified version with enhanced functionality.
+and returns an enhanced, modified function.
 Use @property for computed attributes.

Similarity: 0.95

95% similarity with minimal changes usually means the model is behaving deterministically (minor phrasing variations are normal even with seed, depending on the API). If similarity is <70%, something changed significantly.

Isolating the Source: Bisect Approach

If you can't pinpoint the issue, isolate it by testing components:

def test_determinism_isolation():
    """Test each component separately."""
    
    # 1. Test LLM API directly (no app code)
    print("Testing raw API determinism...")
    resp1 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        temperature=0.7,
        seed=42,
        messages=[{"role": "user", "content": "Say hello"}]
    )
    resp2 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        temperature=0.7,
        seed=42,
        messages=[{"role": "user", "content": "Say hello"}]
    )
    assert resp1.content[0].text == resp2.content[0].text, "API not deterministic"
    print("✓ Raw API is deterministic")
    
    # 2. Test prompt construction
    print("Testing prompt construction...")
    prompt1 = construct_prompt("input1")
    prompt2 = construct_prompt("input1")
    assert prompt1 == prompt2, "Prompt construction non-deterministic"
    print("✓ Prompt construction is deterministic")
    
    # 3. Test preprocessing
    print("Testing preprocessing...")
    processed1 = preprocess("test input")
    processed2 = preprocess("test input")
    assert processed1 == processed2, "Preprocessing non-deterministic"
    print("✓ Preprocessing is deterministic")
    
    # 4. Test end-to-end
    print("Testing end-to-end application...")
    app_result1 = my_app.process("query")
    app_result2 = my_app.process("query")
    if app_result1 != app_result2:
        print("✗ End-to-end not deterministic")
        print(f"  Result 1: {app_result1[:100]}...")
        print(f"  Result 2: {app_result2[:100]}...")
    else:
        print("✓ End-to-end is deterministic")

test_determinism_isolation()

Quick Fixes

Once you identify the source:

Missing seed: Add seed=42 to your LLM call.

Temperature not pinned: Add temperature=0.7 (or your chosen value).

Non-deterministic preprocessing: Use str.strip(), re.sub(r'\s+', ' ', ...), or similar to normalize consistently.

Changing prompt: Wrap prompt in a function to ensure it's the same every time.

Model alias: Replace model="gpt-4" with model="gpt-4-0125-preview".

Retry with new seed: Use the same seed across retry attempts.

Key Takeaways

Non-determinism usually comes from: missing seed, unfixed temperature, model alias, changed prompt, or non-deterministic preprocessing.
Use a systematic checklist to rule out causes one by one.
Log comprehensively: every LLM query should log its inputs and outputs to a file for later analysis.
Use diff tools to compare actual outputs; similarity scores reveal how different they are.
Test components in isolation (API, prompt construction, preprocessing, end-to-end) to narrow down the issue.

Frequently Asked Questions

My seed works on one API but not another. Why?

Not all APIs support seed (or honor it correctly). OpenAI and Anthropic support it; others may not. Check your provider's docs. Also, even with seed, floating-point arithmetic differences across hardware can cause tiny variations. Use tolerance assertions for production.

My logs show the same seed and temperature, but outputs still differ. What's happening?

Possible causes: (1) The prompt is actually different (check character-by-character), (2) The model was updated (version changed), (3) The API has a bug (rare but happens), (4) Network issues caused a different code path. Add hash of the prompt to logs to verify it's truly identical.

Is it okay to have minor variations even with seed set?

With some APIs and models, seed reduces but doesn't eliminate variation (especially at lower temperatures). If variation is <5%, it's usually acceptable. For exact determinism, use temperature = 0.0 (greedy decoding).

Determinism Debugging Checklist​

Comprehensive Logging for Diagnosis​

Diff Analysis: Comparing Outputs​

Isolating the Source: Bisect Approach​

Quick Fixes​

Key Takeaways​

Frequently Asked Questions​

My seed works on one API but not another. Why?​

My logs show the same seed and temperature, but outputs still differ. What's happening?​

Is it okay to have minor variations even with seed set?​

Further Reading​