Training Teacher Models: Foundation Prep Guide
A teacher model is the foundation of successful distillation. No matter how well you design the student architecture or optimize the distillation loss, a weak teacher produces a weak student. The teacher must be a high-capacity model that has learned rich, task-specific knowledge—whether through pre-training on massive corpora or fine-tuning on labeled examples. This article covers the practical steps to prepare a teacher model that is robust, well-calibrated, and suitable for knowledge transfer.
What Makes a Good Teacher Model?
A good teacher model combines three properties: high accuracy on the task, calibrated uncertainty, and generalization beyond the training distribution. A model that achieves 99% accuracy on training data but only 85% on held-out data is a poor teacher: the student will inherit both the high accuracy and the overfitting. A well-generalized teacher, even if it achieves slightly lower accuracy (e.g., 97% vs. 99%), produces better students because its learned representations are more robust.
Calibration is often overlooked but critical. A well-calibrated model outputs high confidence for correct predictions and low confidence for mistakes. A poorly calibrated model might output confidence 0.9 for wrong answers, misleading the student. Calibration directly affects the softness of the knowledge transferred: high-confidence wrong predictions waste student capacity, while appropriately uncertain teachers encode useful negative information.
Step 1: Select and Validate a Pre-trained Base
Most teacher models start from a pre-trained foundation: a language model (Llama, Mistral, Qwen), vision model (ViT, ResNet, EfficientNet), or multimodal model (CLIP, LLaVA). Validate that the base model aligns with your task:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load a pre-trained teacher (e.g., Llama 2 7B)
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
teacher = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Use half-precision for memory efficiency
device_map="auto" # Auto-shard across GPUs if available
)
# Verify the model loads and produces reasonable outputs
inputs = tokenizer("What is knowledge distillation?",
return_tensors="pt").to(teacher.device)
with torch.no_grad():
outputs = teacher.generate(
inputs.input_ids,
max_new_tokens=50,
temperature=0.7,
top_p=0.9
)
print(tokenizer.decode(outputs[0]))
This validation ensures the base model is compatible with your tokenizer, device setup, and task format. Choose a pre-trained model whose weights are publicly available and compatible with your infrastructure (check VRAM, inference framework support).
Step 2: Fine-Tune the Teacher on Task Data
Fine-tuning adapts the pre-trained model to your specific task. Use a supervised fine-tuning (SFT) approach: train on (input, target output) pairs from your task. For language tasks, this is instruction-following data. For classification, it is labeled examples. For generation, it is reference outputs.
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import Trainer, TrainingArguments
class TaskDataset(Dataset):
"""Custom dataset for supervised fine-tuning."""
def __init__(self, examples, tokenizer, max_length=512):
self.examples = examples
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
example = self.examples[idx]
# Combine prompt and target into a single sequence
text = f"{example['input']}\n{example['output']}"
encoding = self.tokenizer(
text,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
# For causal LM, mask out the input portion (optimize only target)
input_ids = encoding['input_ids'].squeeze()
attention_mask = encoding['attention_mask'].squeeze()
# Compute where the output begins (simple approach: count input tokens)
input_len = len(self.tokenizer.encode(example['input']))
labels = input_ids.clone()
labels[:input_len] = -100 # Ignore input tokens in loss
return {
'input_ids': input_ids,
'attention_mask': attention_mask,
'labels': labels
}
# Prepare training data
train_examples = [
{"input": "Explain distillation.", "output": "Knowledge distillation is..."},
# ... more examples
]
train_dataset = TaskDataset(train_examples, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
# Fine-tune using Hugging Face Trainer
training_args = TrainingArguments(
output_dir="./teacher_model",
num_train_epochs=3,
per_device_train_batch_size=8,
learning_rate=2e-5,
warmup_steps=100,
weight_decay=0.01,
logging_steps=50,
save_steps=500,
eval_strategy="steps",
eval_steps=500,
save_total_limit=3,
)
trainer = Trainer(
model=teacher,
args=training_args,
train_dataset=train_dataset,
# optionally provide eval_dataset for validation
)
trainer.train()
Key hyperparameters for fine-tuning:
- Learning rate: Start with 2e-5 (for LLMs); 1e-4 for smaller models. Use a cosine learning rate schedule with warm-up.
- Batch size: 8-16 per GPU; accumulate gradients if memory is tight.
- Epochs: 1-5, depending on dataset size. Monitor validation loss and stop when it plateaus.
- Regularization: Add weight decay (0.01) and dropout to prevent overfitting.
Step 3: Evaluate and Calibrate the Teacher
After fine-tuning, measure performance on a held-out test set. For classification, use accuracy; for generation, use BLEU, ROUGE, or human evaluation. Also measure calibration using Expected Calibration Error (ECE):
import numpy as np
from sklearn.metrics import accuracy_score
def expected_calibration_error(model, val_loader, num_bins=10):
"""
Compute ECE: measure how well model confidence aligns with accuracy.
Lower ECE (closer to 0) means the model is well-calibrated.
"""
model.eval()
all_confidences = []
all_correct = []
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(model.device), labels.to(model.device)
logits = model(inputs).logits
probs = torch.softmax(logits, dim=1)
# Confidence = max probability
confidences, predictions = torch.max(probs, dim=1)
correct = (predictions == labels).float()
all_confidences.append(confidences.cpu().numpy())
all_correct.append(correct.cpu().numpy())
confidences = np.concatenate(all_confidences)
correct = np.concatenate(all_correct)
# Bin confidences and compute calibration
bins = np.linspace(0, 1, num_bins + 1)
ece = 0.0
for i in range(num_bins):
mask = (confidences >= bins[i]) & (confidences < bins[i+1])
if mask.sum() > 0:
bin_acc = correct[mask].mean()
bin_conf = confidences[mask].mean()
ece += np.abs(bin_acc - bin_conf) * mask.sum() / len(correct)
return ece
# Evaluate on validation set
val_loader = DataLoader(val_dataset, batch_size=32)
ece = expected_calibration_error(teacher, val_loader)
print(f"Expected Calibration Error: {ece:.4f}")
An ECE below 0.05 indicates good calibration. If ECE is high, apply calibration techniques:
- Temperature scaling: Divide logits by a learned temperature T before softmax. This does not change predictions but adjusts confidence.
- Focal loss during training: Emphasizes hard examples, improving calibration on tricky samples.
- Label smoothing: Smooth hard labels (e.g., change one-hot [1, 0, 0] to [0.9, 0.05, 0.05]) to reduce overconfidence.
Step 4: Ensure Reproducibility and Document
Save the teacher model, tokenizer, and training configuration for reproducibility:
# Save the fine-tuned teacher
teacher.save_pretrained("./teacher_model_final")
tokenizer.save_pretrained("./teacher_model_final")
# Save training metadata
import json
metadata = {
"base_model": model_name,
"task": "instruction-following",
"num_train_examples": len(train_examples),
"num_epochs": 3,
"learning_rate": 2e-5,
"batch_size": 8,
"validation_accuracy": 0.96,
"expected_calibration_error": 0.032,
"timestamp": "2026-06-02"
}
with open("./teacher_model_final/metadata.json", "w") as f:
json.dump(metadata, f, indent=2)
Document the teacher's strengths and weaknesses. If the teacher struggles on certain classes or domains, the student will inherit these gaps. Identify blind spots so you can mitigate them during distillation (e.g., via sampling strategies).
Teacher Model Characteristics for Successful Distillation
| Property | Target Range | Impact on Student |
|---|---|---|
| Validation accuracy | 90-99% | Higher = more knowledge to transfer |
| Expected calibration error (ECE) | <0.05 | Lower = richer soft targets |
| Training/validation loss ratio | 1.0-1.2 | Lower = less overfitting, better generalization |
| Inference latency (on task data) | Varies | Affects practical deployment constraints |
| Model size (parameters) | 5B-70B | Larger = more knowledge; balance with distillation cost |
Key Takeaways
- A good teacher model is high-accuracy, well-calibrated, and generalizes beyond training data.
- Fine-tune a pre-trained base on task-specific labeled data using supervised learning with task-appropriate loss.
- Validate the teacher's accuracy, calibration (ECE), and generalization before proceeding to distillation.
- Save the teacher, tokenizer, and metadata for reproducible distillation workflows.
- Monitor ECE and use temperature scaling or label smoothing if the teacher is overconfident.
Frequently Asked Questions
Can I use a teacher that is already fine-tuned on a similar task?
Yes, transfer learning applies to teacher models too. A teacher fine-tuned on a related task often outperforms one trained from scratch on a small dataset. Validate that the transferred knowledge is relevant; if not, fine-tune further on your task-specific data.
How much training data do I need to fine-tune a teacher?
For a pre-trained model, 1,000-10,000 labeled examples usually suffices to fine-tune to good performance. Smaller datasets (100-1,000) work if paired with strong regularization; larger datasets (>100,000) yield marginal improvements for most tasks. Use data stratification to ensure balance across classes.
What is the relationship between teacher accuracy and student accuracy?
Student accuracy is approximately 90-98% of teacher accuracy (depending on compression ratio). A 99% accurate teacher typically yields a student accurate to 95-99%. If the teacher is weak (75% accuracy), the student will also be weak, so do not skimp on teacher quality.
Should the teacher and student have the same architecture?
No. They can differ significantly. You might distill a Transformer teacher into a CNN student, or vice versa. The only requirement is that both handle the same input and output format. Different architectures sometimes outperform same-architecture distillation because they learn complementary features.
How do I know when the teacher has finished training?
Monitor validation loss and accuracy. Stop when validation loss stops decreasing for several epochs (early stopping). For most tasks, 3-5 epochs suffices; more epochs risk overfitting. Use a patience parameter (e.g., stop if no improvement for 2 epochs) to automate this decision.