Instruction formatting for LLM fine-tuning explained
Instruction formatting determines how examples are structured for fine-tuning. The same 500 examples formatted correctly can double training effectiveness compared to poorly formatted data. Format defines the contract between input and output: how the model learns to parse instructions, context, and expected responses. This article covers three dominant formats (instruction-response, chat, and task-specific), token accounting, and anti-patterns to avoid.
Three Standard Formats
Format 1: Instruction-Response (Simplest)
The instruction-response format is the simplest and most common for instruction-following fine-tuning:
{
"instruction": "Explain quantum entanglement in simple terms.",
"response": "Quantum entanglement happens when two particles become linked so that they instantly affect each other..."
}
This format works well for:
- Q&A tasks
- Summarization (instruction: article, response: summary)
- Translation (instruction: English text, response: translated text)
- Code generation (instruction: description, response: code)
Use this format when each example is self-contained and doesn't require conversation history. OpenAI's fine-tuning API expects instruction-response pairs wrapped in a specific JSONL schema. HuggingFace Transformers accept this format natively.
Format 2: Chat Format (Multi-turn Conversations)
The chat format mirrors real conversations with role-based turns:
{
"messages": [
{"role": "user", "content": "What is your refund policy?"},
{"role": "assistant", "content": "We offer full refunds within 30 days of purchase..."},
{"role": "user", "content": "What if the item is damaged?"},
{"role": "assistant", "content": "In that case, we replace it at no cost..."}
]
}
Chat format is ideal for:
- Customer support conversations.
- Tutoring or Socratic dialogue.
- Systems that maintain context across turns.
- Multi-turn reasoning tasks.
Anthropic's fine-tuning API and OpenAI's newer fine-tuning endpoints both support chat format natively. Each turn teaches the model to respond appropriately given prior conversation history.
Format 3: Task-Specific Structures
Some tasks require custom structures. Here are common variants:
For classification (labeling text with a category):
{
"text": "The product broke after one day.",
"label": "negative",
"instruction": "Classify this review as positive, negative, or neutral."
}
For structured output (generating JSON):
{
"description": "John Smith, age 35, from New York, works in tech",
"instruction": "Extract the person's name, age, location, and profession into JSON.",
"response": "{\"name\": \"John Smith\", \"age\": 35, \"location\": \"New York\", \"profession\": \"tech\"}"
}
For conditional response (different styles):
{
"instruction": "Write a poem about autumn.",
"style": "haiku",
"response": "Leaves turn golden-red,\\nChill winds whisper secrets,\\nEarth rests, dreams anew."
}
Token Accounting and Length Constraints
Every model has a context window: GPT-4 has 128K, Claude 3.5 Sonnet has 200K, Llama 3 has 8K. Fine-tuning examples must fit within your model's context window. An overly long example wastes compute and can cause training errors if it exceeds the max length.
Token counting:
Use the model's official tokenizer to count tokens. For OpenAI:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
instruction = "Explain quantum entanglement"
response = "Quantum entanglement is a phenomenon in physics..."
instruction_tokens = len(enc.encode(instruction))
response_tokens = len(enc.encode(response))
total = instruction_tokens + response_tokens
print(f"Instruction: {instruction_tokens} tokens")
print(f"Response: {response_tokens} tokens")
print(f"Total: {total} tokens")
Typical token counts:
- Short instruction (e.g., "What is X?"): 5–10 tokens.
- Medium instruction (e.g., multi-sentence question): 20–50 tokens.
- Long instruction (e.g., a full paragraph of context): 100–200 tokens.
- Short response (one sentence): 15–30 tokens.
- Long response (multi-paragraph): 200–500 tokens.
For fine-tuning, aim for examples with 50–500 total tokens. Examples under 50 tokens are often too trivial; examples over 1,000 tokens become expensive and may exceed context windows. A typical fine-tuning batch might have 500 examples averaging 150 tokens each, totaling 75K tokens.
Structuring Effective Instruction-Response Pairs
Make instructions specific and actionable. Vague instructions teach the model vague behavior.
Bad: "Explain the concept."
Good: "Explain quantum entanglement in 2–3 sentences, avoiding jargon."
The good version teaches the model how to respond (length, audience, style).
Include context when needed. If the response depends on prior information, embed it:
{
"instruction": "Summarize the following article in one paragraph: [article text]",
"response": "[summary]"
}
Provide diverse examples of the same task. Don't repeat the exact same instruction:
{"instruction": "What is machine learning?", "response": "..."}
{"instruction": "Define machine learning.", "response": "..."}
{"instruction": "Explain what machine learning means.", "response": "..."}
These variations teach the model to handle rephrasing.
Specify output format explicitly. If you need JSON, code, or structured output, include it in the instruction:
{
"instruction": "Extract the following fields from this customer review as JSON: sentiment (positive/negative), product_category, and suggested_rating (1-5). Review: [text]",
"response": "{\"sentiment\": \"positive\", \"product_category\": \"electronics\", \"suggested_rating\": 4}"
}
Common Anti-Patterns
Anti-pattern 1: Inconsistent formatting. Some examples use instruction: ... and response: ..., others use prompt: ... and completion: .... This confuses the model. Pick one schema and validate it programmatically:
import json
from jsonschema import validate, ValidationError
schema = {
"type": "object",
"properties": {
"instruction": {"type": "string"},
"response": {"type": "string"}
},
"required": ["instruction", "response"]
}
with open("dataset.jsonl") as f:
for i, line in enumerate(f):
example = json.loads(line)
try:
validate(example, schema)
except ValidationError as e:
print(f"Line {i}: {e}")
Anti-pattern 2: Leaking answers in instructions. Telling the model the answer in the instruction defeats fine-tuning:
Bad: "What is 2+2? The answer is 4."
Good: "What is 2+2?"
Anti-pattern 3: Mismatched complexity. Pairing a trivial instruction with a complex response (or vice versa) teaches inconsistent behavior:
Bad:
{"instruction": "Hi.", "response": "Hello! How can I assist you with your complex technical infrastructure question?"}
Anti-pattern 4: Excessive length variance. If responses vary from 10 tokens to 1,000 tokens, the model learns to overfit on longer patterns. Normalize response length or stratify examples by length during training.
Anti-pattern 5: PII and sensitive data. Don't include real customer names, email addresses, credit card numbers, or proprietary data without anonymization. This violates privacy and can leak during inference.
Validation and Testing
Before fine-tuning, validate your format with a trial:
import json
def validate_dataset(filepath):
examples = []
with open(filepath) as f:
for line in f:
try:
example = json.loads(line)
assert "instruction" in example and "response" in example
assert isinstance(example["instruction"], str)
assert isinstance(example["response"], str)
assert len(example["instruction"]) > 5
assert len(example["response"]) > 5
examples.append(example)
except (json.JSONDecodeError, AssertionError) as e:
print(f"Invalid example: {line[:100]} - {e}")
print(f"Validated {len(examples)} examples successfully")
return examples
examples = validate_dataset("training_data.jsonl")
Key Takeaways
- Instruction-response format is simplest for single-turn tasks; chat format for multi-turn conversations; task-specific formats for classification, structured output, or conditional behavior.
- Aim for examples with 50–500 total tokens; shorter examples are trivial, longer ones waste compute.
- Make instructions specific, include context when needed, provide diverse rephrasing, and specify output format explicitly.
- Validate your dataset format with a schema before uploading to fine-tune.
- Avoid inconsistent formatting, leaking answers, mismatched complexity, excessive length variance, and PII.
Frequently Asked Questions
What's the difference between "instruction" and "prompt"?
In fine-tuning literature, "instruction" typically refers to a user query or task description, while "prompt" can include instruction plus context or examples. For consistency, use "instruction" in your schema and ensure all examples follow the same field names.
Should I include the instruction in the response?
No. The instruction is the input; the response is the output the model should produce. Including the instruction in the response is data leakage and teaches the model to repeat itself.
How do I format code generation examples?
Use a schema that separates description, context, and code:
{
"instruction": "Write a Python function to compute factorial using recursion.",
"response": "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)"
}
Include newlines (\n) in responses to preserve formatting.
Can I fine-tune on examples longer than 2,000 tokens?
Technically yes, but it's inefficient. Examples over 2,000 tokens use more compute without proportionally improving learning. Break long documents into chunks or summarize context instead.
How do I handle multi-language datasets?
Use consistent formatting with a language tag if needed:
{
"language": "en",
"instruction": "What is machine learning?",
"response": "..."
}
Keep languages separate during fine-tuning or mix them only if your model is multilingual and you want it to handle both.