Skip to main content

Knowledge Transfer: Training Student Models Effectively

Training the student model is where knowledge transfer actually happens. Unlike traditional supervised learning, where you minimize cross-entropy loss on ground-truth labels, student training balances two objectives: matching the teacher's soft targets (knowledge transfer) and fitting ground-truth labels (task anchoring). The interplay between these two losses, controlled by the temperature and alpha hyperparameters, determines how much knowledge the student absorbs and how well it generalizes. This article covers the practical techniques for effective student training in 2026.

The Distillation Training Loop Revisited

Recall the combined loss from Article 1:

loss = alpha * hard_loss + (1 - alpha) * soft_loss * (temperature ^ 2)

The temperature ^ 2 scaling is crucial: without it, the KL divergence (soft loss) magnitude depends on temperature, making hyperparameter tuning fragile. By scaling by T^2, both loss components have comparable magnitude regardless of temperature choice. This is a practical detail that often goes unmentioned but dramatically affects convergence.

Here is the complete training loop:

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

def distillation_loss(student_logits, teacher_logits, true_labels,
temperature=4.0, alpha=0.7):
"""
Compute combined distillation loss.

Args:
student_logits: [batch_size, num_classes]
teacher_logits: [batch_size, num_classes]
true_labels: [batch_size]
temperature: Softening factor
alpha: Weight for hard loss (0 to 1)

Returns:
Scalar loss value
"""
# Soft loss: KL divergence between soft targets
soft_targets = F.softmax(teacher_logits / temperature, dim=1)
soft_log_probs = F.log_softmax(student_logits / temperature, dim=1)
soft_loss = F.kl_div(soft_log_probs, soft_targets, reduction='batchmean')

# Hard loss: cross-entropy with true labels
hard_loss = F.cross_entropy(student_logits, true_labels)

# Combined loss with temperature scaling
return alpha * hard_loss + (1 - alpha) * soft_loss * (temperature ** 2)

def train_student_with_distillation(
student,
teacher,
train_loader,
val_loader,
num_epochs=10,
learning_rate=1e-3,
temperature=4.0,
alpha=0.7,
device='cuda'
):
"""
Full training loop for student model with distillation.

Args:
student: Student model to train
teacher: Pre-trained teacher model (frozen)
train_loader: DataLoader for training
val_loader: DataLoader for validation
num_epochs: Number of training epochs
learning_rate: Initial learning rate
temperature: Distillation temperature
alpha: Hard loss weight (1 - alpha weights soft loss)
device: 'cuda' or 'cpu'
"""
student.to(device)
teacher.to(device)
teacher.eval() # Teacher is frozen

# Optimizer: use AdamW with weight decay for regularization
optimizer = AdamW(
student.parameters(),
lr=learning_rate,
weight_decay=0.01 # L2 regularization
)

# Learning rate scheduler: cosine annealing with warm-up
total_steps = len(train_loader) * num_epochs
warmup_steps = len(train_loader) # Warm up for 1 epoch

def lr_lambda(current_step):
if current_step < warmup_steps:
return float(current_step) / float(max(1, warmup_steps))
return max(0.0, float(total_steps - current_step) /
float(max(1, total_steps - warmup_steps)))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

best_val_loss = float('inf')
patience = 3 # Early stopping: stop if val loss doesn't improve for 3 epochs
patience_counter = 0

for epoch in range(num_epochs):
# Training phase
student.train()
train_loss = 0.0

for batch_idx, (inputs, labels) in enumerate(train_loader):
inputs, labels = inputs.to(device), labels.to(device)

# Forward pass
student_logits = student(inputs).logits
with torch.no_grad():
teacher_logits = teacher(inputs).logits

# Compute loss
loss = distillation_loss(
student_logits, teacher_logits, labels,
temperature=temperature, alpha=alpha
)

# Backward pass
optimizer.zero_grad()
loss.backward()

# Gradient clipping to prevent instability
torch.nn.utils.clip_grad_norm_(student.parameters(), max_norm=1.0)

optimizer.step()
scheduler.step()

train_loss += loss.item()

if (batch_idx + 1) % 100 == 0:
print(f"Epoch {epoch+1}/{num_epochs}, Batch {batch_idx+1} - "
f"Loss: {loss.item():.4f}, LR: {scheduler.get_last_lr()[0]:.2e}")

avg_train_loss = train_loss / len(train_loader)

# Validation phase
student.eval()
val_loss = 0.0
correct = 0
total = 0

with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)

student_logits = student(inputs).logits
teacher_logits = teacher(inputs).logits

loss = distillation_loss(
student_logits, teacher_logits, labels,
temperature=temperature, alpha=alpha
)
val_loss += loss.item()

# Compute accuracy
predictions = torch.argmax(student_logits, dim=1)
correct += (predictions == labels).sum().item()
total += labels.size(0)

avg_val_loss = val_loss / len(val_loader)
val_accuracy = correct / total

print(f"Epoch {epoch+1}/{num_epochs} - "
f"Train Loss: {avg_train_loss:.4f}, "
f"Val Loss: {avg_val_loss:.4f}, "
f"Val Acc: {val_accuracy:.4f}")

# Early stopping
if avg_val_loss < best_val_loss:
best_val_loss = avg_val_loss
patience_counter = 0
# Save best checkpoint
torch.save(student.state_dict(), "student_best.pt")
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch+1}")
break

# Load best checkpoint
student.load_state_dict(torch.load("student_best.pt"))
return student

This loop incorporates several best practices: AdamW optimizer with weight decay, cosine annealing with warm-up, gradient clipping, and early stopping.

Hyperparameter Tuning: Temperature and Alpha

The two most critical hyperparameters are temperature (T) and alpha (α). Their interaction is non-trivial.

Temperature Effects:

# Experiment with different temperatures
temperatures = [2.0, 4.0, 8.0, 16.0]
results = {}

for T in temperatures:
student = initialize_student()
trained_student = train_student_with_distillation(
student, teacher, train_loader, val_loader,
temperature=T, alpha=0.7
)
val_acc = evaluate(trained_student, val_loader)
results[T] = val_acc
print(f"Temperature {T}: Validation Accuracy = {val_acc:.4f}")

# Example output (typical patterns):
# Temperature 2.0: Validation Accuracy = 0.9450
# Temperature 4.0: Validation Accuracy = 0.9520 <- Often optimal
# Temperature 8.0: Validation Accuracy = 0.9480
# Temperature 16.0: Validation Accuracy = 0.9350

Higher temperatures soften the teacher's distributions, revealing more information but also introducing noise. Too low (T < 2) and the student essentially learns from hard targets; too high (T > 20) and the teacher's signal becomes too diffuse. The Goldilocks zone is usually T=4-8.

The optimal temperature depends on task characteristics:

Task CharacteristicRecommended TemperatureReasoning
Teacher very confident, few hard negativesT=2-4Soft targets add little; focus on hard labels
Teacher uncertain, many plausible classesT=8-12Rich soft signal; higher T needed
Small dataset (< 10K examples)T=6-10Higher T reduces overfitting to limited data
Large dataset (> 100K examples)T=4-6Large data provides signal; lower T works
Classification task (discrete outputs)T=4-6Standard; soft targets are informative
Generation task (continuous outputs)T=8-16Less structured; higher T provides regularization

Alpha Effects:

Alpha controls the balance between hard and soft losses. α=1.0 ignores the teacher entirely (pure supervised learning); α=0.0 ignores ground truth (pure distillation).

# Experiment with different alphas
alphas = [0.3, 0.5, 0.7, 0.9]
results = {}

for a in alphas:
student = initialize_student()
trained_student = train_student_with_distillation(
student, teacher, train_loader, val_loader,
temperature=4.0, alpha=a
)
val_acc = evaluate(trained_student, val_loader)
results[a] = val_acc

# Example patterns:
# alpha=0.3: 0.9520 (too much teacher, can overfit to teacher quirks)
# alpha=0.5: 0.9560 (balanced)
# alpha=0.7: 0.9580 <- Often optimal
# alpha=0.9: 0.9550 (too much ground truth, discards teacher signal)

Optimal alpha is typically 0.5-0.8. Values below 0.3 make the student too dependent on teacher; above 0.9, the student ignores the teacher. In practice, α=0.7 is a strong default and rarely needs tuning beyond that.

Curriculum Learning and Progressive Distillation

A more advanced technique is curriculum learning: start with easy examples (or soft targets from a confident teacher) and gradually increase difficulty.

def progressive_distillation_training(
student, teacher, train_loader, val_loader,
num_epochs=10, initial_alpha=0.3, final_alpha=0.9
):
"""
Progressively increase the weight of hard labels (ground truth).
This helps the student learn basic task structure before trying
to match nuanced teacher behavior.
"""
total_batches = len(train_loader) * num_epochs
batch_count = 0

for epoch in range(num_epochs):
for batch in train_loader:
# Interpolate alpha between initial and final
progress = batch_count / total_batches
alpha = initial_alpha + (final_alpha - initial_alpha) * progress

# Training step with current alpha
loss = distillation_loss(
student_logits, teacher_logits, labels,
alpha=alpha
)
# ... backward and optimize

batch_count += 1

This technique can improve final accuracy by 0.5-1.5% by guiding the student to learn incrementally. The trade-off: requires more explicit scheduling logic.

Convergence Diagnostics

Monitor these signals to diagnose training issues:

# Plot training curves
import matplotlib.pyplot as plt

epochs = range(1, num_epochs + 1)
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(epochs, train_losses, label='Train Loss')
plt.plot(epochs, val_losses, label='Val Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Distillation Loss Over Time')

plt.subplot(1, 2, 2)
plt.plot(epochs, val_accuracies, label='Val Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Student Validation Accuracy Over Time')

plt.tight_layout()
plt.savefig('training_curves.png')

Signs of healthy training:

  • Training loss decreases smoothly.
  • Validation loss decreases and plateaus (not diverging).
  • Validation accuracy increases and plateaus.

Signs of problems:

  • Training loss oscillates wildly → learning rate too high.
  • Validation loss increases while training loss decreases → overfitting → use regularization.
  • Both losses stagnant → student too small, learning rate too low, or alpha/temperature suboptimal.

Key Takeaways

  • Distillation loss combines hard and soft targets, balanced by alpha and temperature.
  • Temperature scales soft target softness (higher T = softer, more diverse signals); optimal range is 4-8.
  • Alpha balances hard loss weight (0.5-0.8 is optimal); values outside 0.2-0.95 rarely help.
  • Use AdamW, cosine annealing with warm-up, gradient clipping, and early stopping for stable training.
  • Monitor training and validation curves to diagnose convergence issues and adjust hyperparameters.

Frequently Asked Questions

How long does it take to train a student model?

Typically 1-5 days on a single GPU (V100, A100, H100) for 100K-500K synthetic examples. On TPUs or multi-GPU setups, 6-12 hours. Much faster than training a large model from scratch (weeks).

Should I train on 100% synthetic data or mix with real labels?

If you have real labeled data, mix 70% synthetic + 30% real. This grounds the student in true task distribution. If you only have synthetic, 100% synthetic is acceptable; ensure the synthetic distribution matches deployment.

What learning rate should I use for the student?

Start with 1e-3 to 1e-4. For smaller students (1B params), 1e-4 is safer. For larger students (7B), 1e-3 often works. Use a scheduler (cosine or exponential decay) rather than fixed rate; this improves convergence by 20-30%.

Can I use batch normalization or layer normalization in the student?

Yes, but typically not needed. Transformers and modern LLMs use layer normalization, which is already baked in. If using a custom student architecture, layer normalization is preferable to batch normalization (more stable across batch sizes).

How do I know when to stop training (early stopping patience)?

Use a patience of 2-3 epochs (stop if validation loss does not improve for 2-3 epochs). For larger datasets, patience can be higher (5 epochs). Monitor validation accuracy; if it plateaus, training is done.

Further Reading