Sampling for Human Review: Scale with Judgment
Human-in-the-loop (HITL) is essential for LLM quality, but humans are expensive. You cannot label every output; you must sample strategically. This article covers sampling strategies that maximize human judgment per label: active learning (sample uncertain outputs), stratified sampling (ensure coverage across cohorts), and diversity-based sampling (avoid redundant labels). Done right, strategic sampling lets you improve quality continuously without overwhelming your review budget.
Why sampling is critical for LLMOps
Every month, your production LLM generates millions of outputs. If you reviewed 1% (10k outputs), at 5 minutes per review, that's 50k person-hours—infeasible. But if you review 0.1% strategically (1k outputs selected by active learning), you might capture 80% of the quality issues while staying within budget.
Strategic sampling accomplishes several goals:
- Labeling efficiency: Get the most information per label (active learning).
- Coverage: Ensure you sample across user cohorts, domains, and output difficulty levels (stratification).
- Cost control: Keep human effort within budget while improving quality.
- Continuous learning: Use labeled samples to retrain evaluators and improve monitoring.
Sampling strategies
1. Random sampling (baseline)
Randomly select N outputs uniformly from production traffic. Simple, unbiased, but inefficient: many random samples are easy/obvious and provide little information.
# Random sampling
import random
def random_sample(outputs, sample_size=100):
"""
Randomly sample N outputs from a list.
"""
return random.sample(outputs, min(sample_size, len(outputs)))
# Example
outputs = [f"output_{i}" for i in range(10000)]
sample = random_sample(outputs, 100)
print(f"Random sample size: {len(sample)}")
2. Active learning: sample by uncertainty
Score each output using your evaluator, then sample outputs where the evaluator is most uncertain. Uncertain outputs are likely to be borderline (good or bad), providing the most information for improving thresholds and classifiers.
# Active learning: uncertainty sampling
import numpy as np
def uncertainty_sample(outputs, evaluator, sample_size=100):
"""
Sample outputs where the evaluator is most uncertain.
Uncertainty = how close the score is to the decision boundary (0.5).
"""
scores = []
for output in outputs:
scores_dict = evaluator.evaluate("dummy_prompt", output)
score = scores_dict.get("overall", 0.5)
uncertainty = 1.0 - abs(score - 0.5) * 2 # Distance from boundary
scores.append((output, score, uncertainty))
# Sort by uncertainty (descending) and take top N
scores.sort(key=lambda x: x[2], reverse=True)
sample = [s[0] for s in scores[:sample_size]]
return sample
# Example
evaluator = OnlineEvaluator()
sample = uncertainty_sample(outputs, evaluator, 100)
print(f"Uncertainty sample size: {len(sample)}")
3. Stratified sampling: ensure representation
Divide outputs into strata (e.g., by user cohort, domain, quality tier), then sample proportionally or uniformly from each stratum. This prevents bias toward popular cohorts and ensures you understand quality across all user groups.
# Stratified sampling
def stratified_sample(outputs_with_metadata, strata_key, sample_size=100):
"""
Sample uniformly from each stratum.
outputs_with_metadata: list of dicts with output, metadata, etc.
strata_key: key in metadata dict that defines strata (e.g., "user_tier").
"""
# Group by stratum
strata = {}
for item in outputs_with_metadata:
stratum = item["metadata"][strata_key]
if stratum not in strata:
strata[stratum] = []
strata[stratum].append(item)
# Sample uniformly from each stratum
sample_per_stratum = sample_size // len(strata)
sample = []
for stratum, items in strata.items():
stratum_sample = random.sample(items, min(sample_per_stratum, len(items)))
sample.extend(stratum_sample)
return sample
# Example
outputs_with_meta = [
{"output": "...", "metadata": {"user_tier": "free"}},
{"output": "...", "metadata": {"user_tier": "pro"}},
# ... more
]
sample = stratified_sample(outputs_with_meta, "user_tier", 100)
print(f"Stratified sample size: {len(sample)}")
4. Diversity sampling: avoid redundant labels
If you have 1000 similar outputs (e.g., all asking "What is X?"), labeling 100 of them is redundant. Diversity sampling uses clustering or embedding distance to select diverse outputs. This avoids wasting labels on similar items.
# Diversity sampling using clustering
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
def diversity_sample(outputs, sample_size=100, n_clusters=20):
"""
Sample diverse outputs using k-means clustering.
Select the output closest to each cluster centroid.
"""
model = SentenceTransformer("all-MiniLM-L6-v2")
# Embed outputs
embeddings = model.encode(outputs)
# Cluster
kmeans = KMeans(n_clusters=min(n_clusters, len(outputs)), random_state=42)
kmeans.fit(embeddings)
# For each cluster, select the output closest to centroid
sample = []
for center in kmeans.cluster_centers_:
distances = np.linalg.norm(embeddings - center, axis=1)
closest_idx = np.argmin(distances)
sample.append(outputs[closest_idx])
# If sample is smaller than desired, add random samples
if len(sample) < sample_size:
remaining = random.sample(
[o for o in outputs if o not in sample],
sample_size - len(sample)
)
sample.extend(remaining)
return sample[:sample_size]
# Example
sample = diversity_sample(outputs, 100)
print(f"Diversity sample size: {len(sample)}")
5. Composite sampling: combine strategies
In practice, use a weighted ensemble of strategies. For example:
- 40% active learning (sample high-uncertainty outputs)
- 30% stratified by user tier (ensure coverage)
- 20% diversity (avoid redundant labels)
- 10% random (catch tail cases)
# Composite sampling strategy
def composite_sample(outputs, evaluator, metadata, sample_size=100):
"""
Combine multiple sampling strategies.
"""
n_uncertainty = int(0.4 * sample_size)
n_stratified = int(0.3 * sample_size)
n_diversity = int(0.2 * sample_size)
n_random = sample_size - n_uncertainty - n_stratified - n_diversity
# Get samples from each strategy
uncertainty_sample_list = uncertainty_sample(outputs, evaluator, n_uncertainty)
stratified_sample_list = stratified_sample(
[{"output": o, "metadata": metadata.get(o, {})} for o in outputs],
"user_tier",
n_stratified
)
diversity_sample_list = diversity_sample(outputs, n_diversity)
random_sample_list = random_sample(outputs, n_random)
# Combine and deduplicate
combined = uncertainty_sample_list + [s["output"] for s in stratified_sample_list] + diversity_sample_list + random_sample_list
sample = list(set(combined))[:sample_size]
return sample
# Example
sample = composite_sample(outputs, evaluator, {}, 100)
print(f"Composite sample size: {len(sample)}")
Implementing a human review workflow
Once you've selected samples, set up an efficient review process:
# Human review workflow
class HumanReviewQueue:
def __init__(self, db):
self.db = db
def create_review_batch(self, sample, reviewer_id, batch_name):
"""
Create a batch of outputs for a human to review.
"""
batch = {
"batch_id": f"{batch_name}_{datetime.utcnow().timestamp()}",
"reviewer_id": reviewer_id,
"outputs": sample,
"created_at": datetime.utcnow().isoformat(),
"status": "pending",
"labels": {}
}
self.db.insert("review_batches", batch)
return batch["batch_id"]
def submit_labels(self, batch_id, labels):
"""
Reviewer submits labels for a batch.
labels: dict of {output_id: label}
"""
batch = self.db.query("review_batches", filters={"batch_id": batch_id})[0]
batch["labels"] = labels
batch["status"] = "completed"
batch["completed_at"] = datetime.utcnow().isoformat()
self.db.update("review_batches", batch)
def compute_inter_rater_agreement(self, batch_ids):
"""
Measure agreement between reviewers (if double-labeled).
"""
batches = self.db.query("review_batches", filters={"batch_id": {"$in": batch_ids}})
# Assume some outputs are reviewed by multiple reviewers
agreements = {}
for batch in batches:
# Logic to compute agreement per output
pass
return agreements
# Example
review_queue = HumanReviewQueue(db)
batch_id = review_queue.create_review_batch(sample, reviewer_id="alice", batch_name="eval_batch_001")
# (Alice reviews and submits labels)
review_queue.submit_labels(batch_id, {"output_0": "good", "output_1": "bad"})
Handling disagreement and ambiguity
When multiple reviewers label the same output differently, or when a single reviewer is uncertain, escalate for consensus:
# Disagreement resolution
def identify_disagreements(labeled_samples):
"""
Find outputs labeled differently by multiple reviewers.
"""
disagreements = {}
for output_id in set(s["output_id"] for s in labeled_samples):
labels = [s["label"] for s in labeled_samples if s["output_id"] == output_id]
if len(set(labels)) > 1:
disagreements[output_id] = labels
return disagreements
# Example
labeled = [
{"output_id": "out_1", "reviewer": "alice", "label": "good"},
{"output_id": "out_1", "reviewer": "bob", "label": "bad"},
{"output_id": "out_2", "reviewer": "alice", "label": "good"},
]
disagreements = identify_disagreements(labeled)
print(f"Disagreements: {disagreements}")
Feeding labels back into training
Labels from human review should improve your evaluators and models:
# Use human labels to retrain evaluators
def retrain_evaluator(labeled_samples, old_evaluator, model_type="reward"):
"""
Retrain an evaluator using newly labeled samples.
"""
prompts = [s["prompt"] for s in labeled_samples]
outputs = [s["output"] for s in labeled_samples]
labels = [1 if s["label"] == "good" else 0 for s in labeled_samples]
if model_type == "reward":
# Fine-tune a reward model on the new labels
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
trainer = Trainer(model, training_args, train_dataset=...)
# trainer.train()
return model
Sampling cost and ROI
Track the ROI of your human review program:
# Sampling ROI analysis
def compute_sampling_roi(labeled_samples, quality_improvement):
"""
Compute return on investment of human review.
"""
review_cost_per_label = 5 # USD
total_cost = len(labeled_samples) * review_cost_per_label
# Assume quality improvement translates to reduced user churn
churn_reduction_value = quality_improvement * 100000 # USD value of reduced churn
roi = (churn_reduction_value - total_cost) / total_cost * 100
print(f"Review cost: ${total_cost}, Value gained: ${churn_reduction_value}, ROI: {roi:.1f}%")
return roi
Key Takeaways
- Strategic sampling makes human review cost-effective: 0.1-0.5% of production data can drive 80% of quality improvements.
- Active learning (uncertainty sampling) prioritizes borderline outputs that provide maximum information.
- Stratified and diversity sampling ensure you learn across all user cohorts and avoid redundant labels.
- Combine strategies: active learning + stratification + diversity covers most use cases efficiently.
- Feed labeled data back into evaluator training and model fine-tuning for continuous improvement.
Frequently Asked Questions
How many samples should I review per month?
1-5% of traffic, depending on quality SLA and budget. Start with 0.5% (~500 outputs/month for 100k monthly requests); scale up if issues are discovered.
Should I use consensus labels from multiple reviewers?
For high-stakes decisions (safety, fairness), yes. For routine quality assessment, one reviewer per output is sufficient if they're well-trained. Double-label 10-20% to measure agreement.
How do I reduce labeling cost?
Use weak labels (binary yes/no instead of detailed feedback), outsource to domain experts who review faster, or use semi-automated labels (auto-label easy cases, human review hard ones).
Can active learning hurt by creating biased samples?
Yes; uncertainty samples are often edge cases or out-of-distribution. Balance with stratification and random sampling to ensure your labeled set represents real production traffic.
How often should I retrain my evaluators?
Monthly minimum, or weekly if you discover systematic quality issues. More frequent retraining (weekly) lets you adapt faster to user behavior changes.