Skip to main content

Chat format datasets: Structuring conversations for training

Chat format datasets represent conversations with multiple turns and explicit role labels (user, assistant, system). Unlike instruction-response pairs, chat data preserves context across turns: earlier messages inform later responses. This article covers structuring multi-turn conversations, handling system prompts, managing context windows, and common pitfalls in dialogue datasets.

Core Chat Format Structure

Chat format uses a messages array where each message has a role and content:

{
"messages": [
{"role": "user", "content": "I need help with my billing issue."},
{"role": "assistant", "content": "I'd be happy to help. Can you tell me what the issue is?"},
{"role": "user", "content": "I was charged twice for my subscription."},
{"role": "assistant", "content": "I apologize for that. I've found your account..."}
]
}

Roles:

  • user — The human asking the question or providing input.
  • assistant — The model's response (what you're teaching it to say).
  • system — Instructions for the assistant's behavior (optional but powerful).

The system role is not a user or assistant turn; it's a meta-instruction that frames the conversation:

{
"messages": [
{"role": "system", "content": "You are a helpful customer support agent for a SaaS company. Be empathetic, clear, and accurate. If unsure, ask clarifying questions rather than guessing."},
{"role": "user", "content": "How do I cancel my subscription?"},
{"role": "assistant", "content": "To cancel, log into your account, go to Settings > Billing, and click 'Cancel Subscription'. You'll have access until the end of your billing cycle."}
]
}

Building Multi-Turn Conversations

A multi-turn conversation teaches the model to maintain coherence across several exchanges. Here's a realistic example:

{
"messages": [
{"role": "system", "content": "You are a Python tutor. Explain concepts clearly, provide examples, and encourage the student to try code themselves."},
{"role": "user", "content": "What are decorators in Python?"},
{"role": "assistant", "content": "Decorators are functions that modify the behavior of another function or class without changing the original function's code. They use the `@` symbol syntax."},
{"role": "user", "content": "Can you give an example?"},
{"role": "assistant", "content": "Sure! Here's a simple example:\n\n```python\ndef my_decorator(func):\n def wrapper():\n print('Before function')\n func()\n print('After function')\n return wrapper\n\n@my_decorator\ndef say_hello():\n print('Hello!')\n\nsay_hello()\n```\n\nThis will print:\n```\nBefore function\nHello!\nAfter function\n```"},
{"role": "user", "content": "How do decorators with arguments work?"},
{"role": "assistant", "content": "Decorators with arguments use an extra layer of nesting..."}
]
}

This single example teaches the model to:

  1. Follow the system instruction (be a Python tutor).
  2. Answer the initial question.
  3. Provide a concrete example when asked.
  4. Handle follow-up questions with reference to prior context.

Multi-turn conversations are significantly more powerful than single-turn pairs, but they're also longer and consume more tokens during training. A 4-turn conversation might use 600 tokens vs. 100 tokens for a simple instruction-response pair.

System Prompts in Fine-tuning

System prompts define the assistant's persona and constraints. Including them in your training data is critical if you want the fine-tuned model to internalize that behavior.

If you fine-tune on conversations with system prompts, the model learns to follow the framing. For example, a fine-tuned model on customer support conversations with a "be empathetic" system prompt will tend to produce empathetic responses even when deployed without the system prompt.

If you fine-tune on conversations without system prompts, the model learns the assistant's behavior in isolation, and you'll need to provide the system prompt at inference time.

Here's an example of a system prompt effective for fine-tuning:

{
"messages": [
{"role": "system", "content": "You are a technical documentation writer. Write clear, concise explanations. Use examples. Organize with headers and bullet points. Assume the reader is a beginner."},
{"role": "user", "content": "How do I set up a Python virtual environment?"},
{"role": "assistant", "content": "## Setting Up a Python Virtual Environment\n\nA virtual environment is an isolated Python workspace...\n\n### Steps\n\n1. Open your terminal.\n2. Run `python3 -m venv myenv`...\n"}
]
}

Best practices for system prompts:

  • Be specific. "You are a helpful assistant" is too vague. "You are a billing support agent who clarifies charges and processes refunds" is actionable.
  • Include constraints. "Never disclose sensitive customer data." "Keep responses under 200 words." "Avoid speculation; cite internal documentation."
  • Provide style guidance. "Use simple language." "Be empathetic but professional." "Include code examples when relevant."
  • Avoid meta-instructions. Don't say "You will now role-play as...". Just state the role directly.

Token Management for Chat Conversations

Multi-turn conversations consume tokens quickly. Here's how to account for them:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain machine learning."},
{"role": "assistant", "content": "Machine learning is a subset of AI where systems..."},
{"role": "user", "content": "What are the types?"},
{"role": "assistant", "content": "There are three main types..."}
]

total_tokens = 0
for msg in conversation:
# Account for role and formatting overhead (~4 tokens per message)
tokens = len(enc.encode(msg["content"])) + 4
total_tokens += tokens
print(f"{msg['role']}: {tokens} tokens")

print(f"Total: {total_tokens} tokens")

A typical 5-turn conversation (system + 2 user turns + 2 assistant turns) uses 400–800 tokens. If you have 1,000 such conversations, that's 400K–800K tokens for training. Compare this to instruction-response pairs at 100–300 tokens each.

Strategy for managing length:

  • Include 2–4 turns per conversation maximum.
  • If a conversation is longer, split it into multiple training examples (each preserving context from prior turns).
  • Prioritize quality of examples over quantity; a 3-turn conversation with perfect dialogue is more valuable than a 10-turn conversation with noise.

Extracting Conversations from Production Logs

Production conversations are gold: they're real, diverse, and reflect actual use cases. Here's how to extract them:

import json
from datetime import datetime

def extract_conversation_from_log(ticket):
"""
Extract a chat-format conversation from a customer support ticket.
Assumes ticket has fields: messages (list), created_at, agent_name, etc.
"""
conversation = {
"messages": [
{
"role": "system",
"content": "You are a customer support agent. Be helpful, accurate, and friendly."
}
]
}

for msg in ticket["messages"]:
role = "user" if msg["sender"] == "customer" else "assistant"
conversation["messages"].append({
"role": role,
"content": msg["text"]
})

return conversation

# Example log entry
ticket = {
"id": "TICKET-123",
"created_at": "2026-01-15T10:30:00Z",
"messages": [
{"sender": "customer", "text": "Can I return this product?"},
{"sender": "agent", "text": "Of course! We accept returns within 30 days."},
{"sender": "customer", "text": "What about the return shipping cost?"},
{"sender": "agent", "text": "We provide a prepaid shipping label."}
]
}

conversation = extract_conversation_from_log(ticket)
print(json.dumps(conversation, indent=2))

Common Pitfalls in Chat Datasets

Pitfall 1: Alternating roles unclear. If user and assistant roles are swapped or unclear, the model learns incorrect behavior. Always validate role sequences:

def validate_roles(messages):
"""Ensure roles alternate: system (optional) then user, assistant, user, assistant, ..."""
roles = [m["role"] for m in messages]

# System should be first (optional)
if roles[0] == "system":
roles = roles[1:]

# Then must alternate user, assistant, user, assistant, ...
expected = ["user" if i % 2 == 0 else "assistant" for i in range(len(roles))]
if roles != expected:
return False, f"Invalid sequence: {roles}"

return True, "Valid"

messages = [
{"role": "system", "content": "..."},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi"},
{"role": "user", "content": "Help me"}
]

is_valid, msg = validate_roles(messages)
print(msg)

Pitfall 2: Incomplete conversations. A conversation ending mid-turn (e.g., missing the assistant's final response) teaches the model to stop mid-response. Always end conversations with an assistant turn:

Bad:

{
"messages": [
{"role": "user", "content": "How do I...?"},
{"role": "assistant", "content": "You can..."}
]
}

Good:

{
"messages": [
{"role": "user", "content": "How do I...?"},
{"role": "assistant", "content": "You can..."},
{"role": "user", "content": "What about edge case X?"},
{"role": "assistant", "content": "For edge case X, you should..."}
]
}

Pitfall 3: Leaking future context. If the assistant's response references information the user hasn't provided yet, the model learns impossible behavior. Always ensure responses logically follow from prior messages.

Pitfall 4: System prompt over-tuning. If all your training conversations have highly specific system prompts, the model becomes dependent on those exact prompts. Vary system prompts or include conversations without them.

Key Takeaways

  • Chat format is essential for multi-turn dialogue fine-tuning; it preserves context across turns.
  • System prompts define behavior and persona; include them in training data if you want the model to internalize them.
  • Limit conversations to 2–4 turns to manage token usage; prioritize quality over quantity.
  • Extract conversations from production logs, validate role alternation, and ensure conversations end with assistant turns.
  • Multi-turn datasets are more powerful but more expensive than instruction-response pairs; use both strategically.

Frequently Asked Questions

Do I need system prompts in my fine-tuning data?

Not strictly, but it helps. If you include system prompts in training data, the model internalizes the behavior. If you don't, you'll need to provide system prompts at inference time. For maximum flexibility, include 50–70% of examples with system prompts and 30–50% without.

How many turns should a conversation have?

Aim for 2–4 turns (1 user turn + 1–2 assistant turns, plus optional system). Longer conversations teach context handling but consume more tokens. If you have longer conversations, split them into multiple examples, preserving context.

Can I fine-tune on one-turn conversations in chat format?

Yes. A single user-assistant exchange is valid:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

This is equivalent to instruction-response format but in chat schema. Use this if your provider requires chat format for consistency.

How do I handle conversations with multiple participants (e.g., customer, agent, supervisor)?

Map them to user and assistant roles. If a supervisor intervenes, it's still an assistant turn. Keep the schema simple: user (customer, external) and assistant (all internal responses).

Should I anonymize PII in chat conversations?

Yes, always. Replace real names, emails, phone numbers, and account IDs with placeholders like [CUSTOMER_NAME], [EMAIL], [ACCOUNT_ID]. This protects privacy and prevents the model from memorizing sensitive data.

Further Reading