Build a Fine-Tuning Dataset: Step-by-Step
Building a fine-tuning dataset is the most labor-intensive part of the fine-tuning pipeline. A good dataset has 500–5,000 diverse, correctly labeled examples that represent your task's real-world distribution. This article teaches you to design examples, efficiently label data, validate quality, and estimate labeling time and cost.
What Makes a Good Fine-Tuning Example?
A fine-tuning example is a pair: an input (what the model receives) and an output (what you want the model to produce). Both must be representative of real use cases.
Input: "My credit card was charged twice for the same order."
Output: "billing" # Category: billing issue
Input: "The app keeps crashing when I try to upload a file."
Output: "technical_support" # Category: technical issue
A good example has these properties:
- Realistic: It reflects actual queries or data your system will encounter, not synthetic edge cases.
- Diverse: Across 1,000 examples, you cover different phrasings, lengths, and contexts.
- Correctly labeled: The output is the ground truth; ambiguous or incorrect labels poison the training data.
- Balanced: You have roughly equal representation of each class/outcome (or deliberately weighted for importance).
Step 1: Define Your Task and Examples
Write a precise definition of what the model should do. For example:
Task: Intent Classification for Customer Support
Input: Customer message (free text, 1–500 words)
Output: One of these 5 categories: billing, technical_support, refund_request, product_inquiry, complaint
Example pairs:
- Input: "I was charged twice." → Output: "billing"
- Input: "Your app is slow." → Output: "complaint"
- Input: "What tier am I on?" → Output: "product_inquiry"
Edge cases to handle:
- Ambiguous messages (pick the primary intent)
- Multi-intent messages (pick the dominant intent; mark secondary if needed)
- Out-of-scope questions (mark as "other")
This definition is your annotation guide. Share it with all labelers.
Step 2: Source Raw Data
Where does your training data come from? Options:
-
Existing records: Customer tickets, chat logs, support emails, user feedback. Often the cheapest source (already collected). Drawback: historical data may be incomplete or biased.
-
New collection: Solicit examples from your team or users. More expensive but often higher quality and fresher.
-
Public datasets: Find similar tasks on Hugging Face, Kaggle, or academic sources. Good for validating your approach but may not match your exact domain.
-
Synthetic generation: Use a strong base model (e.g., Claude Opus) to generate plausible examples. Fast and cheap but risks distribution mismatch; use primarily for augmentation, not core training.
Recommendation: Start with existing records (cheapest), augment with 20% synthetic data, and plan to add new real examples as they arrive.
Step 3: Annotate Examples
Labeling is typically the bottleneck. Three approaches:
In-House Labeling
You and your team label examples. Pros: cheap, high control. Cons: only realistic for 100–500 examples; prone to label drift (your definitions evolve over time).
Process: Create a shared spreadsheet or tool (e.g., Prodigy, Label Studio) with:
- Column 1: Raw example (customer message)
- Column 2: Annotation (category, label, or output)
- Column 3: Confidence (high/medium/low)
- Column 4: Notes (edge cases, ambiguities)
Have multiple people label 10% of examples independently, measure agreement (Cohen's kappa, ideally 0.75+), then resolve disagreements via discussion.
Contract Annotation Services
Services like Scale AI, Labelbox, or Amazon Mechanical Turk handle labeling at scale. Costs vary: $0.50–$5 per example depending on complexity.
Process: Write a detailed annotation guide (examples, edge cases, common mistakes). Upload examples. Crowdworkers label. Quality assurance catches errors (typically 10–20% error rate; require 3 independent labels per example and vote).
Cost estimate: 1,000 examples at $2/example = $2,000. Add 20% for QA.
Active Learning (Recommended for High Efficiency)
Label 100 examples manually, train a model, then ask the model to flag uncertain predictions. Label only the uncertain examples, retrain, repeat. This typically reduces total labels needed by 30–50%.
def active_learning_loop(unlabeled_pool: list, batch_size: int = 50, iterations: int = 5):
"""
Iteratively label examples by focusing on model uncertainty.
Args:
unlabeled_pool: List of unlabeled examples
batch_size: Number to label per iteration
iterations: Number of rounds
"""
labeled = []
for i in range(iterations):
# Step 1: Label a small batch
print(f"Iteration {i+1}: Label {batch_size} examples from the pool")
batch = unlabeled_pool[:batch_size]
# (Have a human label these)
labeled.extend(batch)
# Step 2: Train a model on labeled data
print(f"Training model on {len(labeled)} labeled examples...")
# model = train_model(labeled)
# Step 3: Find uncertain predictions on remaining unlabeled
print("Finding most uncertain examples for next round...")
uncertainties = []
for example in unlabeled_pool[batch_size:]:
# uncertainty_score = model.predict_uncertainty(example)
# uncertainties.append((example, uncertainty_score))
pass
# Sort by uncertainty; put top batch_size back at front for next iteration
# uncertainties.sort(key=lambda x: -x[1])
# unlabeled_pool = [e[0] for e in uncertainties] + unlabeled_pool[batch_size:]
print(f"After iteration {i+1}: {len(labeled)} labeled")
# Typical workflow: 100 → 200 → 350 → 550 labeled
# vs. passive: 500 labels. Saves 100+ labels while maintaining accuracy.
Step 4: Validate Label Quality
Labeling errors are the #1 cause of fine-tuning failure. Validate with these techniques:
Measure Annotator Agreement
If multiple people label the same 100 examples, compare their labels:
from sklearn.metrics import cohen_kappa_score
labels_annotator_1 = [0, 1, 0, 2, 1, ...] # 100 labels
labels_annotator_2 = [0, 1, 0, 2, 1, ...]
kappa = cohen_kappa_score(labels_annotator_1, labels_annotator_2)
print(f"Agreement (Cohen's kappa): {kappa:.2f}")
# 0.75–1.0: Excellent agreement
# 0.50–0.75: Moderate
# < 0.50: Poor; revisit definitions
Aim for kappa 0.75+. If lower, your task definition is ambiguous; clarify with examples.
Sanity Check: Hold-Out Test
Reserve 10% of labeled data (100 examples). Train a model on the remaining 90%. Evaluate on the held-out 10%. If accuracy is less than 5% lower than on the training set, quality is good. If 15%+ lower, your training labels may have errors.
Spot-Check Manually
Randomly sample 50 labeled examples. Review each one. Incorrect labels should be <2%. If higher, investigate and fix.
Step 5: Structure the Final Dataset
Fine-tuning requires a specific format. Most APIs expect JSONL (JSON Lines):
{"messages": [{"role": "user", "content": "My credit card was charged twice."}, {"role": "assistant", "content": "billing"}]}
{"messages": [{"role": "user", "content": "The app crashes on startup."}, {"role": "assistant", "content": "technical_support"}]}
{"messages": [{"role": "user", "content": "What is your refund policy?"}, {"role": "assistant", "content": "product_inquiry"}]}
Each line is a complete example. Split into:
- Training set: 80% of data.
- Validation set: 10% (used to evaluate during training and catch overfitting).
- Test set: 10% (held out; used only for final evaluation).
import json
import random
def split_and_save_dataset(examples: list, output_dir: str, train_ratio: float = 0.8):
"""Split examples into train/val/test and save as JSONL."""
random.shuffle(examples)
n = len(examples)
n_train = int(n * train_ratio)
n_val = int(n * (1 - train_ratio) / 2)
train = examples[:n_train]
val = examples[n_train:n_train + n_val]
test = examples[n_train + n_val:]
for name, subset in [("train", train), ("val", val), ("test", test)]:
with open(f"{output_dir}/{name}.jsonl", "w") as f:
for example in subset:
f.write(json.dumps(example) + "\n")
print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
# Example
examples = [
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]},
# ... more examples
]
split_and_save_dataset(examples, "datasets/")
Practical Checklist
- Define task and annotation guide (shared with all labelers).
- Source 500+ raw examples (existing records + synthetic augmentation).
- Annotate at least 100 examples; validate agreement (kappa 0.75+).
- Use active learning to reduce total labeling effort by 30–50%.
- Spot-check 50 labeled examples for correctness.
- Split into train/val/test (80/10/10).
- Save as JSONL with correct message format.
- Estimate cost: roughly $1–3 per example after active learning savings.
Cost and Time Estimation
| Approach | Examples | Labeling Time | Cost | Quality |
|---|---|---|---|---|
| In-house only | 100–500 | 2–4 weeks | $200–1K | High |
| Crowdsourced | 1,000–5,000 | 1–2 weeks | $1K–5K | Moderate (needs QA) |
| Active learning | 1,000+ | 3–4 weeks | $1K–2K | High (fewer labels) |
| Hybrid (in-house + crowdsource) | 1,000+ | 2–3 weeks | $1.5K–3K | High |
Key Takeaways
- A good dataset has 500–5,000 diverse, correctly labeled examples representing your task's real-world distribution.
- Labeling is the bottleneck; use active learning to reduce required labels by 30–50%.
- Validate label quality via annotator agreement (Cohen's kappa 0.75+), hold-out test sets, and manual spot-checks.
- Structure data as JSONL with user and assistant messages; split into train/val/test (80/10/10).
- Budget $1–3 per label after active learning savings; total dataset cost is $1,000–$5,000.
Frequently Asked Questions
How many examples do I absolutely need?
Minimum: 50–100 for experimentation. Realistic: 500–1,000 for measurable improvements. Ideal: 2,000–5,000 for robust accuracy gains. Beyond 10,000, diminishing returns kick in.
Can I use data I labeled with a different model?
Yes. Labels are task-specific, not model-specific. As long as the ground truth is correct, data from any source is usable.
How do I handle ambiguous examples during labeling?
Add a "secondary_intent" field or note. If truly ambiguous, drop it (don't include ambiguous examples in training; they confuse the model). Good datasets have clear examples.
Should I balance my classes?
Ideally yes, but only if it matches real-world distribution. If 70% of your tickets are billing-related, your training set should reflect that. Don't artificially oversample rare classes; instead, weight them higher during training.
Can I use pre-labeled public datasets?
Yes, as a starting point. But validate that the labels match your task definition. Public datasets are often noisy; use them to bootstrap, then add high-quality in-domain examples.