Building Preference Pairs: Practical Guide to Data Collection
Preference pairs are the raw material of alignment training: pairs of completions for a single prompt, labeled by a human to indicate which is better. Building a high-quality preference dataset is the first and often most labor-intensive step in RLHF. Success depends on clear annotation guidelines, careful rater selection and training, inter-rater agreement metrics, and strategic blending of human and synthetic data to control cost.
Annotation quality directly impacts downstream reward model performance and final model behavior. A dataset with ambiguous, contradictory, or biased labels teaches the reward model inconsistent preferences, which leads to misaligned final outputs. Conversely, a carefully curated dataset of 20,000 pairs with 80+ percent inter-rater agreement can produce a strong reward model outperforming a noisy 100,000-pair dataset. Most production RLHF projects today spend 30–50 percent of budget on data collection and quality assurance.
Designing Annotation Guidelines: The Rulebook for Raters
Before collecting a single pair, invest time in writing clear, detailed annotation guidelines. These guidelines define what "better" means in your context and reduce ambiguity and disagreement between raters. A good guideline document includes: explicit criteria (accuracy, helpfulness, safety, tone), worked examples (here is a prompt, here are two completions, here is the preferred one and why), edge cases and tiebreakers, and explicit instructions for handling ties or unclear cases.
For example, if you're aligning a medical chatbot, your guidelines might prioritize: (1) factual accuracy (did the response state only known, evidence-based information?), (2) appropriate disclaimers (does it acknowledge uncertainty and recommend consulting a doctor?), (3) helpfulness (does it address the user's question?). A worked example: prompt "What are symptoms of appendicitis?" → two completions. Completion A lists accurate symptoms but omits any disclaimer. Completion B lists symptoms with a clear "consult a doctor immediately" disclaimer. Your guideline specifies that B is preferred because safety and disclaimers override raw helpfulness.
Annotation guidelines should also address disagreement: What if Rater A prefers A and Rater B prefers B? Most projects treat ties as 50–50 data (either as separate single-preference samples or as explicit "tied" labels); some use majority vote (e.g., 3-way annotation, pick the consensus). Clear tiebreaker rules reduce downstream ambiguity and improve inter-rater agreement, typically raising it from 70–75 percent to 80–85 percent.
Human Annotation: Scale and Cost
Scaling human annotation requires infrastructure. You have three main options: in-house annotation (your employees label data), crowdsourcing platforms (Amazon Mechanical Turk, Upwork, Scale AI), or specialized annotation vendors (Surge AI, Outlier AI, Labelbox) who recruit, train, and manage raters. Each has tradeoffs.
In-house annotation is highest quality (raters understand your domain) but slowest and most expensive (salary-loaded, only 1–5 people annotating). Crowdsourcing is cheapest (per-pair costs often $0.10–$0.50 USD, depending on complexity) but quality is variable: you must screen raters, pay attention bonuses for quality, and carefully validate work. Annotation vendors are middle ground: moderate cost ($0.30–$2+ per pair for specialized tasks), higher quality (pre-screened, trained raters), and managed overhead.
A typical production RLHF project uses a hybrid: initial validation and high-stakes examples (disagreements, edge cases) are annotated in-house or by domain experts, while the bulk of straightforward pairs are crowdsourced. This balances cost and quality. For 50,000 pairs at $0.50 per pair via crowdsourcing, expect $25,000 in direct annotation costs, plus infrastructure, QA, and rater management.
Inter-Rater Agreement: Measuring Consistency
Inter-rater agreement (IRA) quantifies how often raters agree on which completion is better. The simplest metric is percent agreement: if 100 pairs are annotated by 2 raters independently, how many pairs do both raters rank the same way? Percent agreement often ranges from 70–90 percent in practice (humans are subjective!). A more rigorous metric is Cohen's kappa, which accounts for chance agreement; kappa values above 0.80 are considered "strong agreement."
Disagreement is not always bad. If two raters disagree on a pair, it often signals an ambiguous or genuinely tied example. Some projects treat disagreements as features: they mark which pairs have high uncertainty, allowing the reward model to down-weight or ignore them. Others use disagreement as a signal to revise guidelines: if 20+ pairs on a particular topic show 50–50 disagreement, your guidelines are unclear on that topic—rewrite them, re-annotate, and move on.
Track IRA per-rater, per-category (safety vs. helpfulness vs. code quality), and per-prompt-type. A rater who agrees with the team 65 percent of the time may need retraining; one at 90 percent is reliable. A category (e.g., medical safety) with 60 percent IRA might need better guidelines. This granular analysis prevents systemic bias and ensures the dataset is reliable.
Synthetic Preference Data: Cost-Effective Scaling
As alignment datasets have grown (modern projects target 100,000+ pairs), pure human annotation becomes prohibitively expensive. Synthetic preference generation uses rule-based classifiers, heuristics, or other LLMs to generate preference labels automatically. For example:
- Rule-based: "If completion A contains exact citations and completion B does not, prefer A" for a research assistant.
- Model-based: Use GPT-4 to score two completions and pick the higher-scoring one.
- Heuristic: "Prefer shorter responses unless the prompt asks for detail."
Synthetic data is 100–1000x cheaper than human annotation but introduces new risks: the heuristic may be wrong (shorter responses aren't always better), biases propagate (if your LLM judge has an implicit bias, it taints the dataset), and overfitting is possible (the reward model learns the judge's patterns rather than true quality).
Best practice is to blend: use heuristics or a strong model (GPT-4) to label 70–80 percent of the data, then allocate human annotation budget to the remaining 20–30 percent and to validation. After training the reward model, evaluate it on a held-out human-annotated test set; if performance gaps emerge between human and synthetic regions, add more human data to those areas. Many successful 2025–2026 projects report 60–70 percent synthetic, 30–40 percent human ratios.
Prompt Diversity and Coverage
A preference dataset is only as good as its prompt diversity. If all prompts are customer-service queries, the reward model will learn customer-service preferences but fail on code generation, reasoning, or safety-critical scenarios. Best practice is to sample prompts from multiple domains and difficulty levels:
- Domains: customer support, coding, math, creative writing, open-ended QA, safety-sensitive queries (requests for harmful content, misinformation), reasoning tasks.
- Difficulty: easy (straightforward factual questions), medium (multi-step reasoning, rare domains), hard (adversarial, edge cases, tricky prompts designed to make models fail).
A balanced dataset might allocate 20 percent to each of 5 domains, with 30/40/30 split (easy/medium/hard) within each. This ensures the reward model generalizes across the model's full capability range.
Annotation Interface and Workflow
Invest in a good annotation interface. Raters need to see: the prompt, completion A, completion B (randomized left/right to prevent position bias), and a clear radio-button or button choice (A is better / B is better / tie). Optional: allow raters to give reasons (checkboxes: "more accurate," "better tone," "clearer," etc.) for offline analysis. Reasons improve quality, let you debug failures, and help other raters calibrate.
A typical workflow: (1) raters annotate; (2) you check inter-rater agreement on a sample; (3) if IRA is below target (e.g., 75 percent), stop, revise guidelines, retrain raters, and retry; (4) once IRA stabilizes, annotate the full dataset; (5) perform final QA (check a random sample for quality, review flagged pairs); (6) export and deduplicate.
Code Example: Constructing Preference Data
Below is a Python example showing how to structure and validate preference pairs:
import json
from typing import List, Dict
from collections import defaultdict
class PreferencePair:
"""Represents a single preference annotation."""
def __init__(self, prompt: str, completion_a: str, completion_b: str,
preferred: str, annotator_id: str, confidence: float = 1.0):
self.prompt = prompt
self.completion_a = completion_a
self.completion_b = completion_b
self.preferred = preferred # 'a', 'b', or 'tie'
self.annotator_id = annotator_id
self.confidence = confidence # 0.0-1.0
def to_dict(self) -> Dict:
return {
'prompt': self.prompt,
'completions': [self.completion_a, self.completion_b],
'preferred': self.preferred,
'annotator_id': self.annotator_id,
'confidence': self.confidence,
}
class PreferenceDataset:
"""Manages and validates a preference pair dataset."""
def __init__(self):
self.pairs = []
self.annotator_stats = defaultdict(lambda: {'count': 0, 'confidence_avg': 0.0})
def add_pair(self, pair: PreferencePair):
"""Add an annotated pair."""
self.pairs.append(pair)
stats = self.annotator_stats[pair.annotator_id]
stats['count'] += 1
# Update running average confidence
n = stats['count']
stats['confidence_avg'] = (
(stats['confidence_avg'] * (n - 1) + pair.confidence) / n
)
def inter_rater_agreement(self, pair_indices: List[int]) -> float:
"""
Compute pairwise agreement for pairs annotated by multiple raters.
Simplified: compare rater decisions on the same pair.
"""
if len(pair_indices) < 2:
return 1.0
agreements = 0
comparisons = 0
for i in range(len(pair_indices)):
for j in range(i + 1, len(pair_indices)):
idx_i, idx_j = pair_indices[i], pair_indices[j]
if self.pairs[idx_i].prompt == self.pairs[idx_j].prompt:
comparisons += 1
if self.pairs[idx_i].preferred == self.pairs[idx_j].preferred:
agreements += 1
return agreements / comparisons if comparisons > 0 else 1.0
def save(self, path: str):
"""Save dataset to JSONL (one pair per line)."""
with open(path, 'w') as f:
for pair in self.pairs:
f.write(json.dumps(pair.to_dict()) + '\n')
def summary_stats(self) -> Dict:
"""Return dataset statistics."""
return {
'total_pairs': len(self.pairs),
'unique_prompts': len(set(p.prompt for p in self.pairs)),
'num_annotators': len(self.annotator_stats),
'annotator_stats': dict(self.annotator_stats),
'tie_percentage': sum(1 for p in self.pairs if p.preferred == 'tie') / len(self.pairs) * 100,
}
# Example usage
dataset = PreferenceDataset()
dataset.add_pair(PreferencePair(
prompt="Write a poem about rain.",
completion_a="The rain falls down so wet and cold, drops on leaves of gold.",
completion_b="Pitter-patter, pitter-patter, raindrops kiss the earth.",
preferred='b',
annotator_id='rater_001',
confidence=0.9
))
dataset.add_pair(PreferencePair(
prompt="Explain quantum entanglement.",
completion_a="Two particles are entangled when measuring one instantly affects the other.",
completion_b="Quantum entanglement is a property where particles share a quantum state, and measuring one instantaneously affects the other, violating classical locality assumptions.",
preferred='b',
annotator_id='rater_002',
confidence=0.95
))
print(dataset.summary_stats())
dataset.save('preference_pairs.jsonl')
This code structures preference pairs, tracks annotator statistics, and saves data in JSONL format suitable for training a reward model.
Key Takeaways
- Preference pairs are the atomic unit of alignment data: a prompt with two completions, labeled to indicate which is preferred.
- Clear annotation guidelines, skilled raters, and inter-rater agreement monitoring are critical for data quality; poor-quality data corrupts the reward model.
- Human annotation is expensive ($0.10–$2+ per pair) but highest quality; synthetic generation is cheap but introduces bias and requires validation.
- Hybrid approaches (60–70 percent synthetic, 30–40 percent human) balance cost and quality for production datasets.
- Prompt diversity across domains, difficulty levels, and styles ensures the reward model generalizes broadly.
- Inter-rater agreement targets of 75–85 percent indicate reliable labels; track agreement by rater, domain, and prompt category to debug issues.
Frequently Asked Questions
How many preference pairs do I need to train a good reward model?
A rule of thumb: 10,000–50,000 pairs for smaller models (7B) and straightforward tasks; 50,000–200,000+ for larger models (70B+) and complex objectives. Quality matters more than quantity—20,000 high-quality pairs outperform 100,000 noisy ones. Start with 10,000, evaluate the reward model's generalization, and add more if needed.
Should I use crowdsourcing or in-house annotation?
In-house (domain experts) is highest quality but slow and expensive. Crowdsourcing is fast and cheap but noisy. Best practice: use in-house or specialized vendors for initial setup and high-risk examples (safety-critical pairs, edge cases), then crowdsource the bulk. Use crowdsourcing platforms like Scale AI or Surge AI for better screening and consistency than raw Mechanical Turk.
What should I do if two raters disagree?
Disagreement often signals an ambiguous or tied pair. Options: (1) mark it as a tie and down-weight or exclude it; (2) use a third rater and take majority vote; (3) investigate the disagreement (review guidelines, may need clarification). Don't delete disagreements—they're informative for understanding the reward model's uncertainty.
How do I handle position bias in annotation?
Position bias occurs when raters favor the left or right completion just due to position, not quality. Randomize left-right assignment for each pair (completion A is not always on the left). Also, review your data for position bias: if raters prefer left 55 percent of the time, that's a red flag. Recheck guidelines and retrain raters.
Can I reuse preference data across projects?
Partially. Preference data is task and domain specific. A preference pair from a coding assistant task may not transfer to a safety-critical chatbot. You can reuse prompts (the questions are often generalizable) but should re-annotate pairs with new guidelines. Some labs share preference datasets publicly (e.g., Anthropic's Constitutional AI), but most prefer proprietary, domain-tuned data.
Further Reading
- Scaling Laws for Reward Model Overoptimization — Gao et al. on how preference data quality and quantity affect downstream alignment.
- Learning to summarize from human feedback — OpenAI's early work on collecting and using preference pairs for summarization.
- Weak-to-strong generalization — OpenAI research on using weaker models to generate synthetic preference data.
- Anthropic's Data Best Practices — industry guidance on annotation workflows and quality assurance.