RLHF Step-by-Step: From Data to Optimized Policy
RLHF (Reinforcement Learning from Human Feedback) is the end-to-end process of aligning a language model to human preferences. The pipeline has four stages: supervised fine-tuning (SFT) on high-quality examples, collecting preference pairs, training a reward model, and running reinforcement learning (typically PPO, proximal policy optimization) to optimize the policy (language model) to maximize reward while staying close to the original model. This article walks through the complete workflow, explaining each stage, key hyperparameters, and practical pitfalls.
A production RLHF run typically takes 2–12 weeks from data collection to deployment, depending on dataset size and compute available. The process is iterative: you may discover mid-training that your reward model is overfitting, or that the policy has collapsed to an undesirable local optimum. Successful RLHF requires careful monitoring, frequent evaluation, and willingness to adjust course.
Stage 1: Supervised Fine-Tuning (SFT) Setup
RLHF starts with an SFT model: a base language model fine-tuned on 1,000–10,000 high-quality prompt-response pairs. The SFT model serves as the initialization for RLHF and acts as a regularizer (we train the policy to maximize reward but stay close to the SFT model via KL divergence penalty).
Creating an SFT dataset requires careful curation. Focus on quality over quantity: 5,000 highly curated examples beat 50,000 mediocre ones. Best practice is to collect examples that demonstrate desired behaviors (accurate answers, helpful tone, appropriate refusal of harmful requests). Diversity matters: include multiple domains, styles, and difficulty levels to prevent the SFT model from overfitting to narrow patterns.
Fine-tune the base model on this data using standard supervised learning: cross-entropy loss on the next-token prediction. Use a learning rate around 1e-4 to 5e-5, train for 1–3 epochs with early stopping, and validate on a held-out set. The resulting SFT model is your starting checkpoint for RLHF; it should be significantly better than the base model in human evaluations (>70 percent win rate on the SFT data).
Stage 2: Preference Data Collection and Reward Model Training
Once the SFT model is ready, generate candidate completions for a large set of prompts (e.g., 50,000 prompts). Use the SFT model or base model to generate two or more completions per prompt (via sampling with temperature and diverse decoding strategies). Then annotate these completions: human raters compare pairs and label which is better, producing preference pairs.
This phase is the most labor-intensive. For 50,000 prompts with 1 pair per prompt, budget 50,000–100,000 hours of rater time (roughly $25,000–$50,000 at typical crowdsourcing rates). To manage cost, blend human and synthetic annotation: use heuristics or a strong model (GPT-4) to label 60–70 percent, reserve human annotation for validation and edge cases.
Train the reward model on these preference pairs as described in the previous article. Aim for 85–90 percent validation accuracy; if accuracy is lower, either the preference data is noisy (improve annotation quality) or the reward model capacity is insufficient (use a larger base model). Once trained, validate the reward model on truly held-out human-annotated examples to ensure it generalizes.
Stage 3: Policy Optimization With PPO
Policy optimization is where the model learns to generate better responses. The policy (language model) takes prompts and generates completions; the reward model evaluates these completions. An RL algorithm (PPO in most cases) updates the policy weights to maximize expected reward while incurring a penalty for deviating too far from the SFT baseline (measured by KL divergence).
Key concept: KL divergence regularization. Without regularization, the policy would optimize purely for reward, potentially generating adversarial or out-of-distribution completions that fool the reward model. The KL penalty prevents this: the loss function becomes loss = -reward + beta * KL(policy || sft_model). The parameter beta (typically 0.01–0.1) controls the tradeoff; higher beta enforces closer adherence to the SFT model, lower beta allows more exploration toward high reward.
The PPO algorithm works in episodes: (1) the policy generates completions for a batch of prompts, (2) the reward model scores these completions, (3) the policy is updated to maximize cumulative reward minus KL penalty, using gradient clipping to limit per-step changes. PPO is chosen over simpler policy gradient methods because it's more stable and sample-efficient—critical when compute is expensive.
Key hyperparameters for PPO:
- Learning rate: 5e-6 to 5e-5, typically lower than supervised fine-tuning.
- Number of PPO epochs: 4–16 inner gradient updates per batch of generated completions.
- Batch size: 32–128 completions generated per batch. Larger batches reduce variance but require more compute.
- Rollout batch size: How many prompts are used to generate completions in each episode. Typically 512–2048. Larger rollouts provide better reward signal but are slower.
- KL coefficient (beta): 0.01–0.1. Start at 0.05 and adjust based on KL divergence in logs. If KL is creeping up, increase beta; if the model is changing too slowly, decrease beta.
- Clip ratio (epsilon): typically 0.2. Controls how much the policy can change per PPO update. Standard value.
A typical PPO run generates 1–2 million completions before convergence (e.g., 1000 episodes × 1500 prompts per episode). Training takes 1–7 days on modern hardware (8x A100 GPUs). Monitor KL divergence at each step; if it exceeds your threshold (typically 1.0–2.0 nats), the policy is drifting too far from the SFT model.
Stage 4: Evaluation and Iteration
Throughout RLHF training, continuously evaluate the policy on held-out prompts using the reward model. However, reward model scores can be gamed (reward hacking). A critical evaluation step is to have humans rate completions from the current policy and compare to your SFT baseline. Aim for the RLHF policy to achieve >70 percent win rate vs. the SFT model.
Also monitor safety. Periodically run adversarial evaluations: can the policy be tricked into generating harmful content? If the policy has converged to high reward but low safety, increase the KL penalty or add safety-specific preference data to the training set and retrain.
Common failure modes at this stage:
- Reward hacking: the policy learns to game the reward model (e.g., generating verbose but inaccurate answers if the reward model correlates length with quality). Detected by human evaluation. Fix: add adversarial examples to the preference data, retrain the reward model, or increase KL penalty.
- Collapse: the policy converges to a narrow set of completions (e.g., always generates "I don't know" to avoid errors). Detected by low diversity in generated completions. Fix: increase temperature during generation or adjust reward model to encourage diversity.
- Mode collapse: the policy forgets how to handle certain prompt types (e.g., stops generating code). Monitor per-domain performance and use domain-specific reward weighting if needed.
Code Example: PPO Training Loop Pseudocode
Below is a simplified PyTorch pseudocode for a PPO training loop:
import torch
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
def ppo_training_loop(policy_model, sft_model, reward_model, tokenizer,
prompts, num_episodes=1000, beta=0.05):
"""
Simplified PPO training loop.
Args:
policy_model: language model to optimize
sft_model: SFT baseline for KL divergence
reward_model: reward model for scoring completions
tokenizer: tokenizer for policy_model
prompts: list of prompts to generate completions for
num_episodes: number of PPO update steps
beta: KL divergence coefficient
"""
optimizer = AdamW(policy_model.parameters(), lr=5e-6)
for episode in range(num_episodes):
# Step 1: Generate completions
batch_prompts = prompts[episode % len(prompts) : (episode + 1) % len(prompts)]
completions, log_probs_policy = generate_completions(
policy_model, batch_prompts, tokenizer, sample=True, temperature=0.7
)
# Step 2: Score completions with reward model
rewards = reward_model.score(batch_prompts, completions)
# Step 3: Compute log probs under SFT model (for KL divergence)
log_probs_sft = compute_log_probs(sft_model, batch_prompts, completions, tokenizer)
# Step 4: Compute PPO loss
# Advantages can be computed from rewards (simplified)
advantages = rewards - rewards.mean() / (rewards.std() + 1e-8)
# KL divergence: E[log(policy) - log(sft)]
kl_divergence = log_probs_policy - log_probs_sft
# PPO loss: maximize advantages, minimize KL
policy_loss = -(log_probs_policy * advantages).mean() + beta * kl_divergence.mean()
# Step 5: Update policy
optimizer.zero_grad()
policy_loss.backward()
torch.nn.utils.clip_grad_norm_(policy_model.parameters(), max_norm=1.0)
optimizer.step()
if episode % 100 == 0:
print(f"Episode {episode}: reward={rewards.mean():.2f}, kl={kl_divergence.mean():.4f}")
return policy_model
def generate_completions(model, prompts, tokenizer, sample=True, temperature=0.7,
max_new_tokens=100):
"""Generate completions and return log probabilities."""
completions = []
log_probs = []
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model.generate(
**inputs, max_new_tokens=max_new_tokens,
do_sample=sample, temperature=temperature,
output_scores=True, return_dict_in_generate=True
)
completion = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
completions.append(completion)
# Simplified: average log prob per token
avg_log_prob = torch.stack(outputs.scores).mean() if outputs.scores else torch.tensor(0.0)
log_probs.append(avg_log_prob)
return completions, torch.tensor(log_probs)
def compute_log_probs(model, prompts, completions, tokenizer):
"""Compute log probabilities of completions under the model."""
log_probs = []
for prompt, completion in zip(prompts, completions):
full_text = prompt + completion
inputs = tokenizer(full_text, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
# Log prob is negative loss; average per token
avg_log_prob = -outputs.loss
log_probs.append(avg_log_prob)
return torch.tensor(log_probs)
This pseudocode shows the main loop: generate completions, score them, compute KL divergence, and update the policy. A real implementation adds sampling/generation strategies, gradient accumulation, and distributed training.
Monitoring and Hyperparameter Tuning
Set up comprehensive logging: track reward, KL divergence, policy loss, and diversity metrics (e.g., unique n-grams in generated completions) at each episode. Plot these in real-time (e.g., with TensorBoard or Weights and Biases) to detect collapse or hacking early.
Hyperparameter tuning for PPO is an art. Start with conservative defaults (beta=0.05, clip_ratio=0.2, learning_rate=5e-6) and adjust based on observed behavior:
- If KL divergence is rising consistently (>2.0 nats), increase beta to 0.1 or 0.2.
- If reward is stagnant and KL is below target, decrease beta to 0.01 or lower learning rate.
- If completions are becoming repetitive, increase temperature during generation or add diversity bonuses to the reward function.
A full RLHF run typically requires 1–3 hyperparameter sweeps before finding stable settings.
Key Takeaways
- RLHF is a four-stage pipeline: SFT (10,000 curated examples), preference collection (50,000+ pairs), reward model training (85–90 percent accuracy), and PPO optimization (1–7 days).
- PPO maximizes reward while maintaining KL divergence regularization to prevent exploitation of the reward model.
- Critical hyperparameters: learning rate (5e-6 to 5e-5), KL coefficient (0.01–0.1), batch size (32–128), and clip ratio (0.2).
- Monitor reward, KL divergence, and diversity throughout training. Early detection of hacking or collapse is essential.
- Human evaluation is the gold standard; validate RLHF improvements on held-out human judges, not just reward model scores.
Frequently Asked Questions
How long does a full RLHF run take?
Data collection (preference pairs) is typically 4–8 weeks and is the main bottleneck. Reward model training takes 2–7 days (depending on model size and data volume). PPO training takes 1–7 days. Total: 4–12 weeks from start to production model. With parallelized annotation, this can be reduced.
What if the reward model overfits during RLHF?
If the reward model is overfit, the policy will exploit its blind spots and generate unnatural completions. Detect this via human evaluation: if the policy has high reward but low human preference, the reward model is likely overfit. Fix: retrain the reward model with more diverse data, use a larger base model, or add regularization (dropout, weight decay).
Should I use the same reward model throughout RLHF or update it?
Most production systems freeze the reward model during PPO to ensure stability. However, some advanced approaches periodically retrain the reward model on the policy's new completions (online reward learning) to improve calibration. This is more complex but can lead to better final models. Start with a frozen reward model; explore online learning once the basics work.
What is the ideal KL divergence range during PPO?
Typical target: 0.1–1.0 nats per prompt (average log probability difference). If KL is too high (>2.0), the policy has drifted significantly and may have lost SFT knowledge. If KL is too low (<0.05), the policy isn't exploring enough. Adjust beta based on observed KL.
Further Reading
- Fine-Tuning Language Models from Human Preferences — Christiano et al. on the RLHF pipeline and PPO for language models.
- InstructGPT: Training Language Models to Follow Instructions — OpenAI's production RLHF system at scale.
- Proximal Policy Optimization Algorithms — Schulman et al. foundational PPO paper.
- Safe Reinforcement Learning in Large Language Models — Recent work on safety during RLHF optimization.