Reproducibility in Multi-Turn Conversations

Multi-turn conversations are harder to make reproducible than single-turn prompts. Each turn depends on the prior context, and small variations in early turns cascade into completely different later outputs. This article teaches you to build reproducible conversations: manage message history deterministically, test entire dialogues with snapshots, and handle context window constraints without losing state.

The core challenge: a conversation is a sequence of messages, and the order, content, and even whitespace of messages affect all downstream outputs. To achieve reproducibility, you must version the entire history, not just the latest prompt.

Understanding Conversation State

A multi-turn conversation is represented as a list of messages with roles (user, assistant, system):

messages = [
    {"role": "system", "content": "You are a helpful Python tutor."},
    {"role": "user", "content": "What is a decorator?"},
    {"role": "assistant", "content": "A decorator is a function that modifies..."},
    {"role": "user", "content": "Can you show me an example?"},
    {"role": "assistant", "content": "@functools.lru_cache..."},
]

Each new turn appends a user message and receives an assistant response. This history is sent to the LLM every turn (not just the new message). The LLM reads the full history to maintain context and produce coherent responses.

For reproducibility, every message in the history must be deterministic:

System prompt: Pinned, versioned, deterministic.
User messages: Fixed test inputs (no randomization).
Prior assistant responses: Deterministic outputs from prior turns.

If any message varies, the entire downstream conversation changes.

Snapshot Testing Full Conversations

The cleanest way to test conversations is to snapshot the entire dialogue as a JSON structure:

import json
from anthropic import Anthropic

def test_conversation_snapshot(snapshot):
    """Test a multi-turn conversation snapshot."""
    
    client = Anthropic(api_key="your-key")
    messages = []
    conversation = {}
    
    # Turn 1: Ask about decorators
    user_turn_1 = "What is a Python decorator?"
    messages.append({"role": "user", "content": user_turn_1})
    
    response_1 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        temperature=0.5,
        system="You are a helpful Python expert.",
        messages=messages
    )
    
    assistant_turn_1 = response_1.content[0].text
    messages.append({"role": "assistant", "content": assistant_turn_1})
    conversation["turn_1"] = {
        "user": user_turn_1,
        "assistant": assistant_turn_1
    }
    
    # Turn 2: Ask for an example
    user_turn_2 = "Can you show me a simple example?"
    messages.append({"role": "user", "content": user_turn_2})
    
    response_2 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=300,
        temperature=0.5,
        system="You are a helpful Python expert.",
        messages=messages
    )
    
    assistant_turn_2 = response_2.content[0].text
    messages.append({"role": "assistant", "content": assistant_turn_2})
    conversation["turn_2"] = {
        "user": user_turn_2,
        "assistant": assistant_turn_2
    }
    
    # Turn 3: Ask a follow-up
    user_turn_3 = "How does functools.wraps work?"
    messages.append({"role": "user", "content": user_turn_3})
    
    response_3 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=250,
        temperature=0.5,
        system="You are a helpful Python expert.",
        messages=messages
    )
    
    assistant_turn_3 = response_3.content[0].text
    messages.append({"role": "assistant", "content": assistant_turn_3})
    conversation["turn_3"] = {
        "user": user_turn_3,
        "assistant": assistant_turn_3
    }
    
    # Snapshot the entire conversation
    snapshot.assert_match(conversation)

This snapshot captures all three turns. If any turn's output changes, the test fails. You review the diff and approve if the new output is better.

Deterministic Conversation State Management

For production conversations, manage history in a deterministic way. Use immutable message lists and version them:

from dataclasses import dataclass
from typing import List
import json
from datetime import datetime

@dataclass
class ConversationMessage:
    role: str  # "user", "assistant", "system"
    content: str
    timestamp: str = None
    turn_id: int = None

class DeterministicConversation:
    def __init__(self, system_prompt: str, conversation_id: str):
        self.system_prompt = system_prompt
        self.conversation_id = conversation_id
        self.messages: List[ConversationMessage] = []
        self.turn_count = 0
    
    def add_user_message(self, content: str) -> int:
        """Add user message, return turn ID."""
        self.turn_count += 1
        msg = ConversationMessage(
            role="user",
            content=content,
            timestamp=datetime.utcnow().isoformat(),
            turn_id=self.turn_count
        )
        self.messages.append(msg)
        return self.turn_count
    
    def add_assistant_message(self, content: str, turn_id: int = None):
        """Add assistant message with turn tracking."""
        if turn_id is None:
            turn_id = self.turn_count
        msg = ConversationMessage(
            role="assistant",
            content=content,
            timestamp=datetime.utcnow().isoformat(),
            turn_id=turn_id
        )
        self.messages.append(msg)
    
    def get_messages_for_api(self) -> List[dict]:
        """Return messages in API format (excluding system)."""
        return [
            {"role": msg.role, "content": msg.content}
            for msg in self.messages
        ]
    
    def save_to_db(self):
        """Persist conversation for reproducibility."""
        data = {
            "conversation_id": self.conversation_id,
            "system_prompt": self.system_prompt,
            "messages": [
                {
                    "role": msg.role,
                    "content": msg.content,
                    "timestamp": msg.timestamp,
                    "turn_id": msg.turn_id
                }
                for msg in self.messages
            ],
            "saved_at": datetime.utcnow().isoformat()
        }
        # Save to database or file
        with open(f"conversations/{self.conversation_id}.json", "w") as f:
            json.dump(data, f, indent=2)
    
    @classmethod
    def load_from_db(cls, conversation_id: str) -> "DeterministicConversation":
        """Reload conversation for reproducibility."""
        with open(f"conversations/{conversation_id}.json", "r") as f:
            data = json.load(f)
        
        conv = cls(data["system_prompt"], conversation_id)
        for msg_data in data["messages"]:
            msg = ConversationMessage(**msg_data)
            conv.messages.append(msg)
            if msg.role == "user":
                conv.turn_count = msg.turn_id
        
        return conv

# Usage: deterministic conversation flow
def run_conversation_deterministically():
    conv = DeterministicConversation(
        system_prompt="You are a Socratic tutor in Python.",
        conversation_id="session_12345"
    )
    
    # Turn 1
    conv.add_user_message("What's the difference between a list and a tuple?")
    
    response_1 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        temperature=0.6,
        system=conv.system_prompt,
        messages=conv.get_messages_for_api()
    )
    
    conv.add_assistant_message(response_1.content[0].text)
    
    # Turn 2
    conv.add_user_message("Are they the same performance-wise?")
    
    response_2 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        temperature=0.6,
        system=conv.system_prompt,
        messages=conv.get_messages_for_api()
    )
    
    conv.add_assistant_message(response_2.content[0].text)
    
    # Save for reproducibility
    conv.save_to_db()
    
    return conv

# Later: replay conversation
def replay_conversation(conversation_id: str):
    """Load and inspect a past conversation."""
    conv = DeterministicConversation.load_from_db(conversation_id)
    for msg in conv.messages:
        print(f"{msg.role}: {msg.content[:100]}...")

Handling Context Window Limits

Long conversations may exceed the LLM's context window (e.g., 128K tokens for Claude 3.5 Sonnet). When truncating history, do so deterministically:

def truncate_conversation_for_context(
    messages: List[dict],
    max_tokens: int = 100000,
    strategy: str = "keep_recent"
) -> List[dict]:
    """Truncate conversation to fit context window."""
    
    from anthropic import Anthropic
    client = Anthropic()
    
    # Count tokens in current messages
    def count_tokens(msgs):
        # Use token counting API if available; otherwise estimate
        return len(str(msgs).split())  # Rough estimate
    
    current_tokens = count_tokens(messages)
    
    if current_tokens <= max_tokens:
        return messages  # Fits; no truncation needed
    
    if strategy == "keep_recent":
        # Keep newest messages; drop oldest
        truncated = messages[-50:]  # Keep last 50 messages
        
        # Verify fit
        if count_tokens(truncated) > max_tokens:
            # Too many; drop more
            truncated = messages[-20:]
        
        return truncated
    
    elif strategy == "summarize_old":
        # Summarize early messages, keep recent verbatim
        old_messages = messages[:-10]  # All but last 10
        recent_messages = messages[-10:]
        
        if len(old_messages) > 0:
            old_text = "\n".join([m.get("content", "") for m in old_messages])
            
            # Summarize using LLM
            summary_response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=300,
                messages=[
                    {
                        "role": "user",
                        "content": f"Summarize this conversation in 100 words:\n\n{old_text}"
                    }
                ]
            )
            
            summary = summary_response.content[0].text
            
            # Return summarized old + recent verbatim
            return [
                {"role": "system", "content": f"Previous conversation summary:\n{summary}"},
                *recent_messages
            ]
        
        return messages
    
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

# Usage: truncate before API call if needed
def chat_with_truncation(conversation: DeterministicConversation, user_input: str):
    messages = conversation.get_messages_for_api()
    
    # Ensure we fit in context window
    truncated = truncate_conversation_for_context(messages, max_tokens=100000)
    
    # If truncated, log that information
    if len(truncated) < len(messages):
        print(f"Truncated conversation: {len(messages)} -> {len(truncated)} messages")
        conversation.messages = conversation.messages[-len(truncated):]  # Update state
    
    # Continue conversation
    conversation.add_user_message(user_input)
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        temperature=0.6,
        system=conversation.system_prompt,
        messages=conversation.get_messages_for_api()
    )
    
    conversation.add_assistant_message(response.content[0].text)
    return response.content[0].text

Testing Conversation Coherence

Beyond snapshots, test that conversations remain coherent across turns:

def test_conversation_coherence():
    """Ensure later responses refer back to earlier context correctly."""
    
    conv = DeterministicConversation(
        system_prompt="You are a helpful assistant.",
        conversation_id="test_coherence"
    )
    
    # Turn 1: Establish context
    conv.add_user_message("My name is Alice and I work in machine learning.")
    response_1 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        temperature=0.5,
        system=conv.system_prompt,
        messages=conv.get_messages_for_api()
    )
    conv.add_assistant_message(response_1.content[0].text)
    
    # Turn 2: Ask follow-up (should reference "Alice" or "ML")
    conv.add_user_message("What should I learn next?")
    response_2 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=150,
        temperature=0.5,
        system=conv.system_prompt,
        messages=conv.get_messages_for_api()
    )
    conv.add_assistant_message(response_2.content[0].text)
    
    assistant_turn_2 = response_2.content[0].text.lower()
    
    # Verify coherence: response should reference earlier context
    assert "machine" in assistant_turn_2 or "learning" in assistant_turn_2, \
        f"Response doesn't reference ML context: {assistant_turn_2}"

Key Takeaways

Multi-turn conversations are reproducible when every message in the history is deterministic: pinned system prompt, fixed user inputs, and seeded LLM outputs.
Snapshot test entire conversations (all turns) as JSON structures to catch regressions across the dialogue.
Store conversation state in a database with turn IDs, timestamps, and version tracking for reproducibility and debugging.
Truncate conversations deterministically when exceeding context windows: preserve recent messages or summarize old ones.
Test conversation coherence to ensure later responses reference earlier context correctly.

Frequently Asked Questions

If I change the system prompt, do all saved conversations break?

No, saved conversations persist their original system prompt and messages. They're immutable snapshots. But future conversations use the new system prompt, which may produce different outputs.

Should I snapshot every turn or the entire conversation at the end?

Entire conversation at the end. This catches subtle regressions where earlier turns are fine but later turns depend on fragile context. If you snapshot turn-by-turn, you'll miss issues like "Turn 2 output was slightly different, which caused Turn 3 to diverge."

What if the context window is too small to fit the full history?

Use truncate_conversation_for_context() to drop old messages deterministically (keep recent, or summarize old). Document which strategy you use in logs. For reproducibility, save the truncated history separately so you can replay it later.

Can I test multi-turn conversations without replaying every turn?

Not fully. Multi-turn tests require turning every message (to get full history). However, you can cache responses from earlier turns and reuse them, reducing API calls:

# Save turn outputs to a file
turn_responses = {
    1: "A decorator is...",
    2: "@functools.lru_cache..."
}

# Reuse in tests
messages = [system_prompt, user_1, turn_responses[1], user_2]

Understanding Conversation State​

Snapshot Testing Full Conversations​

Deterministic Conversation State Management​

Handling Context Window Limits​

Testing Conversation Coherence​

Key Takeaways​

Frequently Asked Questions​

If I change the system prompt, do all saved conversations break?​

Should I snapshot every turn or the entire conversation at the end?​

What if the context window is too small to fit the full history?​

Can I test multi-turn conversations without replaying every turn?​

Further Reading​