Train/validation/test split strategy for ML
The train-validation-test split is the foundation of unbiased model evaluation. A careless split can overestimate performance by 20–40% and waste fine-tuning budget on a model that performs worse in production. This article covers split strategies (random, stratified, temporal), sizing guidelines, detecting overfitting, and avoiding data leakage.
Why Splits Matter
Fine-tuning optimizes on training data. Without a separate validation set, you can't detect overfitting. Without a held-out test set, you can't measure true generalization. Here's what each set does:
Training set (typically 60–70%): Used during fine-tuning. The model learns from these examples.
Validation set (typically 15–20%): Used to monitor training and detect overfitting. You check validation loss after each epoch; if it increases while training loss decreases, you're overfitting.
Test set (typically 10–15%): Held out completely. Used once at the end to report final performance. You never use test results to adjust hyperparameters or the model.
Mixing these purposes (e.g., tuning hyperparameters using test results) causes optimistic bias: your reported accuracy is higher than what the model will achieve on truly unseen data.
Split Sizes
| Dataset Size | Train | Val | Test |
|---|---|---|---|
| 100 examples | 60 | 20 | 20 |
| 500 examples | 350 | 75 | 75 |
| 1,000 examples | 700 | 150 | 150 |
| 5,000 examples | 3,500 | 750 | 750 |
| 10,000+ examples | 70% | 15% | 15% |
For small datasets (< 1,000 examples): Use larger val/test splits (20% each) to ensure stable estimates. A 60-20-20 split gives you 100 test examples, enough for statistical significance.
For large datasets (> 10,000 examples): Use smaller val/test splits (10–15% each). A 70-15-15 split is standard.
Minimum test set size: Aim for at least 100 examples. With fewer, estimates have high variance and confidence intervals are wide.
Strategy 1: Random Split
The simplest approach: shuffle and split.
import json
import random
from pathlib import Path
def random_split(filepath, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, random_seed=42):
"""Randomly split dataset into train/val/test."""
random.seed(random_seed)
# Load all examples
examples = [json.loads(line) for line in open(filepath)]
# Shuffle
random.shuffle(examples)
# Calculate indices
n = len(examples)
train_end = int(n * train_ratio)
val_end = train_end + int(n * val_ratio)
# Split
train = examples[:train_end]
val = examples[train_end:val_end]
test = examples[val_end:]
# Write
for name, data in [("train", train), ("val", val), ("test", test)]:
with open(f"{name}.jsonl", "w") as f:
for ex in data:
f.write(json.dumps(ex) + "\n")
print(f"Train: {len(train)} ({100*len(train)/n:.1f}%)")
print(f"Val: {len(val)} ({100*len(val)/n:.1f}%)")
print(f"Test: {len(test)} ({100*len(test)/n:.1f}%)")
When to use: For balanced, unstructured datasets with no temporal or categorical structure.
Strategy 2: Stratified Split
Stratified splitting ensures each set has the same class distribution as the full dataset.
from sklearn.model_selection import train_test_split
from collections import defaultdict
import random
def stratified_split(filepath, class_field="category", train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, random_seed=42):
"""Split dataset while preserving class distribution."""
random.seed(random_seed)
examples = [json.loads(line) for line in open(filepath)]
# Group by class
by_class = defaultdict(list)
for ex in examples:
class_label = ex.get(class_field, "unknown")
by_class[class_label].append(ex)
train, val, test = [], [], []
# Split each class proportionally
for class_label, class_examples in by_class.items():
# First split: train+val vs test
temp, test_split = train_test_split(
class_examples,
test_size=test_ratio,
random_state=random_seed
)
# Second split: train vs val
train_split, val_split = train_test_split(
temp,
test_size=val_ratio / (train_ratio + val_ratio),
random_state=random_seed
)
train.extend(train_split)
val.extend(val_split)
test.extend(test_split)
print(f"{class_label}: train={len(train_split)}, val={len(val_split)}, test={len(test_split)}")
# Write
for name, data in [("train", train), ("val", val), ("test", test)]:
with open(f"{name}.jsonl", "w") as f:
for ex in data:
f.write(json.dumps(ex) + "\n")
When to use: For imbalanced datasets or when classes must be represented equally across splits.
Strategy 3: Temporal Split
For time-series or sequential data, split chronologically: train on older examples, test on newer.
import json
from datetime import datetime
def temporal_split(filepath, date_field="timestamp", train_ratio=0.7, val_ratio=0.15, test_ratio=0.15):
"""Split dataset chronologically (older = train, newer = test)."""
examples = [json.loads(line) for line in open(filepath)]
# Parse dates and sort
for ex in examples:
ex["_parsed_date"] = datetime.fromisoformat(ex[date_field])
examples.sort(key=lambda x: x["_parsed_date"])
# Calculate cutoffs
n = len(examples)
train_end = int(n * train_ratio)
val_end = train_end + int(n * val_ratio)
train = examples[:train_end]
val = examples[train_end:val_end]
test = examples[val_end:]
# Write
for name, data in [("train", train), ("val", val), ("test", test)]:
with open(f"{name}.jsonl", "w") as f:
for ex in data:
# Remove the temporary parsed date
if "_parsed_date" in ex:
del ex["_parsed_date"]
f.write(json.dumps(ex) + "\n")
print(f"Temporal split:")
print(f" Train: {train[0]['_parsed_date']} to {train[-1]['_parsed_date']}")
print(f" Val: {val[0]['_parsed_date']} to {val[-1]['_parsed_date']}")
print(f" Test: {test[0]['_parsed_date']} to {test[-1]['_parsed_date']}")
When to use: For customer support conversations, news articles, code commits, or any data where recency matters. Temporal splits prevent data leakage: the model never trains on future data.
Detecting Overfitting
Monitor the generalization gap: the difference between training and validation loss.
import json
import matplotlib.pyplot as plt
def detect_overfitting(train_loss_per_epoch, val_loss_per_epoch, patience=3):
"""Detect overfitting via generalization gap."""
gaps = [val - train for train, val in zip(train_loss_per_epoch, val_loss_per_epoch)]
# Plot
plt.figure(figsize=(10, 6))
plt.plot(train_loss_per_epoch, label="Training Loss", marker="o")
plt.plot(val_loss_per_epoch, label="Validation Loss", marker="s")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Training vs Validation Loss (Overfitting Detection)")
plt.savefig("overfitting_curve.png")
print("Loss by epoch:")
for i, (train, val, gap) in enumerate(zip(train_loss_per_epoch, val_loss_per_epoch, gaps)):
status = "OK" if gap < 0.1 else "OVERFITTING" if gap > 0.3 else "WARNING"
print(f" Epoch {i}: train={train:.4f}, val={val:.4f}, gap={gap:.4f} [{status}]")
# Detect overfitting: if val loss increases for 'patience' consecutive epochs
increasing_count = 0
for i in range(1, len(val_loss_per_epoch)):
if val_loss_per_epoch[i] > val_loss_per_epoch[i-1]:
increasing_count += 1
if increasing_count >= patience:
print(f"\nOVERFITTING DETECTED at epoch {i}")
print(f"Stop training and use epoch {i - patience} weights")
return i - patience
return None
# Example: from OpenAI fine-tuning logs
train_losses = [2.5, 2.1, 1.8, 1.6, 1.5, 1.4, 1.35, 1.33]
val_losses = [2.4, 2.0, 1.9, 1.8, 1.8, 1.85, 1.92, 2.0]
best_epoch = detect_overfitting(train_losses, val_losses, patience=2)
Signs of overfitting:
- Training loss decreases, validation loss increases or plateaus.
- Validation accuracy drops while training accuracy increases.
- Generalization gap (val loss - train loss) widens over epochs.
Avoiding Data Leakage
Data leakage occurs when information from test/val sets influences training.
Leakage 1: Test set in training
Bad:
examples = load_all_examples()
random.shuffle(examples)
train = examples[:700] # Oops, might contain test examples
test = examples[900:]
fine_tune(train)
evaluate(test)
Good:
examples = load_all_examples()
test_ids = set(random.sample(range(len(examples)), 100))
train = [ex for i, ex in enumerate(examples) if i not in test_ids]
test = [ex for i, ex in enumerate(examples) if i in test_ids]
fine_tune(train)
evaluate(test)
Leakage 2: Preprocessing statistics from full data
Bad:
# Compute mean/std from ALL data
mean = compute_mean(all_examples)
std = compute_std(all_examples)
# Then split
train = all_examples[:700]
test = all_examples[900:]
# Normalize using full-data statistics (leaked!)
train = normalize(train, mean, std)
test = normalize(test, mean, std)
Good:
# Split first
train = all_examples[:700]
test = all_examples[900:]
# Compute statistics from train only
mean = compute_mean(train)
std = compute_std(train)
# Normalize both using train statistics
train = normalize(train, mean, std)
test = normalize(test, mean, std) # uses train's mean/std, not test's
Leakage 3: Hyperparameter tuning on test set
Bad:
# Try different learning rates, evaluate on test set
for lr in [0.001, 0.01, 0.1]:
fine_tune(train, lr=lr)
accuracy = evaluate(test)
print(f"LR={lr}: accuracy={accuracy}")
# Pick best LR and report test accuracy (biased!)
Good:
# Tune on validation set
for lr in [0.001, 0.01, 0.1]:
fine_tune(train, lr=lr)
val_accuracy = evaluate(val)
print(f"LR={lr}: val_accuracy={val_accuracy}")
# Pick best LR
best_lr = 0.01
# Fine-tune once more with best LR on train
fine_tune(train, lr=best_lr)
# Evaluate on test set once (never tuned on it)
test_accuracy = evaluate(test)
print(f"Final test accuracy: {test_accuracy}")
Validating Splits
Before training, verify splits are clean:
def validate_splits(train_path, val_path, test_path):
"""Ensure no overlap between splits."""
train_examples = [json.loads(line) for line in open(train_path)]
val_examples = [json.loads(line) for line in open(val_path)]
test_examples = [json.loads(line) for line in open(test_path)]
# Create content-based hashes (to detect near-duplicates)
from hashlib import sha256
def hash_example(ex):
content = json.dumps([ex.get("instruction"), ex.get("response")], sort_keys=True)
return sha256(content.encode()).hexdigest()
train_hashes = set(hash_example(ex) for ex in train_examples)
val_hashes = set(hash_example(ex) for ex in val_examples)
test_hashes = set(hash_example(ex) for ex in test_examples)
# Check for overlaps
train_val_overlap = train_hashes & val_hashes
train_test_overlap = train_hashes & test_hashes
val_test_overlap = val_hashes & test_hashes
print(f"Train: {len(train_examples)}, Val: {len(val_examples)}, Test: {len(test_examples)}")
print(f"Train-Val overlap: {len(train_val_overlap)} (should be 0)")
print(f"Train-Test overlap: {len(train_test_overlap)} (should be 0)")
print(f"Val-Test overlap: {len(val_test_overlap)} (should be 0)")
if train_val_overlap or train_test_overlap or val_test_overlap:
print("ERROR: Splits are not disjoint!")
return False
return True
Key Takeaways
- Split into train (60–70%), val (15–20%), test (10–15%) with at least 100 test examples.
- Use random split for unstructured data, stratified for imbalanced data, temporal for time-series.
- Monitor generalization gap; if validation loss increases while training loss decreases, you're overfitting.
- Avoid data leakage: split before preprocessing, tune on validation, evaluate once on test.
- Validate splits are disjoint; check for overlaps before training.
Frequently Asked Questions
Should I use k-fold cross-validation for fine-tuning?
Not typically. K-fold is expensive for fine-tuning (requires k training runs). Use a single holdout test set. If you have very small data (< 500 examples), consider stratified k-fold to maximize training data per fold, but report results as average ± std across folds.
Can I use the same validation set for both early stopping and hyperparameter tuning?
Not ideal. If you use validation results to adjust hyperparameters, you're overfitting to that validation set. Ideally, use a separate validation set for early stopping and another for hyperparameter tuning. If data is limited, do one or the other, not both.
What if my test set has different characteristics than production data?
This is distribution shift, and it's a real problem. Periodically collect new production data and evaluate the model on it. If performance drops significantly, retrain on a mix of old training data and new production examples.
How do I choose a random seed for reproducibility?
Pick any integer (e.g., 42, 123). Document it in your code and use it consistently. This ensures anyone can reproduce your splits.
Can I report both accuracy and F1 on the test set?
Yes, report multiple metrics (accuracy, precision, recall, F1, AUC) depending on the task. For imbalanced classification, F1 is more informative than accuracy.