Reward Models: Training LLMs to Judge Quality
A reward model is a neural network (typically a fine-tuned language model) that learns to score completions on a quality dimension, based on human preference data. Instead of outputting text, a reward model takes a prompt and completion as input and outputs a scalar reward value (e.g., -1 to +1 or 0 to 100). Reward models are the bridge between human preferences (preference pairs) and policy optimization (RLHF training). A well-trained reward model generalizes beyond its training data, assigning high scores to genuinely good completions and low scores to poor ones, even on novel prompts and domains.
Reward model quality directly impacts the final aligned LLM. A reward model that overfit to training data will optimize the policy toward unnatural, adversarial completions. A poorly calibrated reward model (one that assigns scores inconsistently across domains) leads to misalignment. Training a robust reward model requires careful architecture choices, loss function selection, handling of ties and uncertainty, and extensive evaluation on held-out data and adversarial examples.
Architecture: Adapting Language Models as Reward Models
The standard architecture for a modern reward model is a language model (LLaMA, Mistral, or similar) fine-tuned on preference pairs. The base language model already understands language and context; the reward model head simply adds a linear layer on top of the final hidden state to predict a scalar score.
Architecturally, a reward model takes the form: Prompt + Completion → [LM hidden states] → [Linear head] → Scalar reward. The hidden state is typically the last token's representation or a pooled representation of the entire completion. The linear head (a single linear layer) maps this representation to a reward. Some designs use a small MLP (multi-layer perceptron) instead of a single linear layer for slightly more flexibility, but the difference is marginal.
A critical design choice: how to handle the prompt and completion together. One approach concatenates them: [prompt tokens] + [completion tokens] → model → reward. Another approach processes them separately and combines their representations. Empirically, concatenation is simpler and performs well; most production reward models use this approach.
Loss Functions: Preference-Based Learning
Since preference pairs are the training signal (not absolute scores), the loss function must reflect ranking, not regression. The standard loss is ranking loss, specifically the Bradley-Terry loss (used by Anthropic and others):
loss = -log(sigmoid(reward_A - reward_B))
Here, reward_A and reward_B are the model's predicted rewards for completions A and B from the same prompt, and we're training on pairs where A is preferred. The sigmoid function maps the difference to a probability, and cross-entropy (the negative log) penalizes errors. Intuitively: if reward_A is higher than reward_B, the model is correct and loss is low; if reward_B is higher, the model is wrong and loss is high.
An alternative is mean-squared error (MSE) loss, treating preferences as implicit scores (preferred completion = 1, non-preferred = 0, tie = 0.5):
loss = (predicted_score_A - 1)^2 + (predicted_score_B - 0)^2
MSE is simpler to implement and works reasonably well, but ranking losses are theoretically better aligned with the preference signal and often empirically outperform MSE in production systems (Anthropic, 2023).
Some projects use classification loss, treating the problem as binary (is A better than B?): the reward model outputs a logit, and cross-entropy loss is applied. This is equivalent to the Bradley-Terry approach but framed as classification.
Handling ties is important: some projects ignore ties during training, others give them a soft label (0.5 in MSE or a custom interpolation in ranking loss). Ignoring ties wastes data; soft labels are more principled. Ties often represent genuinely ambiguous cases and can improve reward model calibration if handled carefully.
Training Procedure and Hyperparameters
Reward model training is a standard supervised fine-tuning task, but with some specific considerations:
- Learning rate: typically 5e-6 to 5e-5, lower than language model fine-tuning since we're adapting an already-trained model. Start at 1e-5 and adjust based on validation loss.
- Batch size: 32–128 depending on model size and GPU memory. Larger batches stabilize the gradient, but require more compute.
- Epochs: 1–3 passes over the data. With high-quality data, even 1 epoch can work; with noisier data, 2–3 helps the model converge.
- Warmup: linear warmup over 5–10 percent of training steps to stabilize early training.
- Validation split: hold out 10–20 percent of data (stratified by domain if possible) for validation and early stopping. Monitor validation ranking accuracy (the fraction of preference pairs where the model ranks correctly).
- Regularization: L2 regularization (weight decay) at 0.01–0.1 helps prevent overfitting; dropout on the reward head is sometimes used.
A typical training loop checks validation ranking accuracy every 500–1000 steps and stops if it plateaus for 2–3 checks. This early stopping prevents overfitting and saves compute.
Code Example: Training a Reward Model
Below is a PyTorch-style pseudo-code for training a reward model:
import torch
import torch.nn as nn
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
class RewardModel(nn.Module):
"""Language model fine-tuned as a reward model."""
def __init__(self, model_name: str, dropout: float = 0.1):
super().__init__()
self.model = AutoModelForCausalLM.from_pretrained(model_name)
hidden_size = self.model.config.hidden_size
self.reward_head = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(hidden_size, 1) # Output: scalar reward
)
# Freeze model weights, only train reward head (or do LoRA for efficiency)
for param in self.model.parameters():
param.requires_grad = False
# Unfreeze last layer for domain adaptation
for param in self.model.model.layers[-1].parameters():
param.requires_grad = True
def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
"""
Args:
input_ids: [batch_size, seq_len]
attention_mask: [batch_size, seq_len]
Returns:
rewards: [batch_size, 1]
"""
# Get last hidden state
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask,
output_hidden_states=True)
last_hidden = outputs.hidden_states[-1] # [batch_size, seq_len, hidden_size]
# Take last token's representation (or mean pooling over non-pad tokens)
last_token_hidden = last_hidden[:, -1, :] # [batch_size, hidden_size]
# Predict reward
reward = self.reward_head(last_token_hidden) # [batch_size, 1]
return reward.squeeze(-1) # [batch_size]
def bradley_terry_loss(reward_preferred: torch.Tensor,
reward_non_preferred: torch.Tensor) -> torch.Tensor:
"""Bradley-Terry ranking loss."""
return -torch.log(torch.sigmoid(reward_preferred - reward_non_preferred)).mean()
def train_reward_model(model, train_loader, val_loader, num_epochs=2, lr=1e-5):
"""Train reward model with validation."""
optimizer = AdamW(model.parameters(), lr=lr, weight_decay=0.01)
best_val_acc = 0.0
for epoch in range(num_epochs):
# Training loop
model.train()
train_loss = 0.0
for batch in train_loader:
# batch = {
# 'prompt_comp_a_ids': [...], 'prompt_comp_a_mask': [...],
# 'prompt_comp_b_ids': [...], 'prompt_comp_b_mask': [...]
# }
reward_a = model(batch['prompt_comp_a_ids'], batch['prompt_comp_a_mask'])
reward_b = model(batch['prompt_comp_b_ids'], batch['prompt_comp_b_mask'])
loss = bradley_terry_loss(reward_a, reward_b)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
train_loss += loss.item()
# Validation loop
model.eval()
val_correct = 0
val_total = 0
with torch.no_grad():
for batch in val_loader:
reward_a = model(batch['prompt_comp_a_ids'], batch['prompt_comp_a_mask'])
reward_b = model(batch['prompt_comp_b_ids'], batch['prompt_comp_b_mask'])
# Accuracy: did the model rank A > B?
val_correct += (reward_a > reward_b).sum().item()
val_total += reward_a.shape[0]
val_acc = val_correct / val_total
print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_acc={val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
# Save checkpoint
torch.save(model.state_dict(), 'best_reward_model.pt')
return model
# Example usage
model = RewardModel('mistral-7b') # or 'meta-llama/Llama-2-7b'
trained_model = train_reward_model(model, train_loader, val_loader, num_epochs=2)
This example shows how to structure a reward model, define ranking loss, and train with validation-based early stopping.
Generalization and Overfitting
Reward models are prone to overfitting because preference pairs are relatively sparse compared to pretraining data. A reward model that memorizes training pairs will score known-good completions highly and unknown completions arbitrarily, leading to policy optimization that explores unnatural behaviors.
Techniques to improve generalization:
- Data augmentation: Paraphrase prompts, introduce minor variations in completions, and re-annotate to expand the training set without human cost.
- Domain-stratified validation: If your data spans multiple domains (coding, writing, QA), validate per-domain to ensure the reward model generalizes within each domain.
- Held-out adversarial evaluation: Before deploying the reward model, test it on hand-crafted adversarial completions designed to fool it. Does it correctly score a completion that is superficially good but factually wrong?
- Calibration: check that the reward distribution matches expectations. If the model assigns rewards too extreme (all very high or very low), it's poorly calibrated and may not distinguish fine-grained quality.
Evaluation: Beyond Accuracy
Training accuracy (fraction of preference pairs ranked correctly) is a necessary but insufficient metric. A reward model that achieves 95 percent training accuracy could still be a poor critic if it overfit or if the test set distribution differs from training.
Key evaluation metrics:
- Validation ranking accuracy: same as training but on held-out pairs. Target 80–95 percent depending on task difficulty.
- Spillover accuracy: test on preference pairs from out-of-distribution domains or prompt types. If your training data is 80 percent coding and 20 percent writing, test spillover on unseen writing examples. Spillover accuracy below validation accuracy indicates overfitting.
- Calibration curves: plot predicted reward vs. realized ranking accuracy. A well-calibrated model's curve is close to the 45-degree line.
- Correlation with human evaluation: sample completions from your policy, have humans rate them, and check if the reward model's scores correlate with human ratings (Spearman correlation, ideally 0.7+).
A typical production reward model achieves 85–90 percent validation ranking accuracy and maintains 75+ percent spillover accuracy.
Key Takeaways
- Reward models are fine-tuned language models that output scalar scores, trained on preference pairs using ranking losses to predict which completion humans prefer.
- Standard architecture: concatenate prompt and completion, pass through a language model, apply a linear head to the last token's hidden state to predict reward.
- Bradley-Terry ranking loss (negative log-sigmoid of the reward difference) is the standard objective, superior to MSE for preference learning.
- Training is standard fine-tuning with lower learning rates (1e-5 to 5e-5), 1–3 epochs, and validation-based early stopping.
- Overfitting is a major risk; monitor generalization via held-out validation, domain spillover, and adversarial evaluation to ensure the reward model doesn't memorize.
Frequently Asked Questions
Should I freeze the language model weights during reward model training?
Partially freezing is common: freeze the majority of the model and fine-tune only the last few layers plus the reward head. This preserves the pretrained knowledge while adapting to your domain. Full training (unfrozen) is slower and requires more data but can improve performance for very specialized domains. Start with partial freezing.
How do I know if my reward model is overfitting?
Compare validation ranking accuracy to spillover accuracy (test on out-of-distribution data). If validation is 92 percent but spillover is 70 percent, you're overfitting. Also check calibration: if the reward scores are extreme (all -10 or +10) rather than spread across the range, the model is likely overconfident. Add regularization (dropout, weight decay) and use more diverse training data.
Can I use a smaller model as a reward model?
Yes. A 7B or 13B model can serve as a reward model, though with some accuracy loss compared to a 70B model. Smaller models are faster to train and run at inference (important if you're doing many reward evaluations during RLHF). Use a smaller model if compute is constrained, but validate that it generalizes adequately.
What if my preference pairs have many ties?
Ties indicate ambiguous examples. Some projects exclude ties; others assign them a soft label (0.5 in MSE loss or handle them specially in ranking loss). Ties can improve reward model robustness by teaching it when not to be confident. Monitor tie frequency—if above 30 percent, your annotation guidelines may be unclear.
Further Reading
- Learning to summarize from human feedback — OpenAI's foundational work on training reward models from preference data.
- Reward model overoptimization in reinforcement learning — Gao et al. on failure modes of reward models under policy optimization.
- Llama 2 Chat: Open Foundation and Fine-Tuned Chat Models — Meta's detailed guide to reward model training in production.
- Scaling Laws and Compute-Optimal Training — Insights into reward model scaling and data efficiency.