A/B Testing Prompts: Compare and Measure Improvements
A/B testing (also called split testing or controlled experiments) is the practice of running two or more prompt variants simultaneously against real or realistic traffic, measuring outcomes, and determining which variant performs best. Without A/B testing, you cannot know if a prompt change actually improves performance—your intuition might be wrong, or improvement might come from confounding factors (time of day, user segment, seasonal trends).
A/B testing is the gold standard for prompt validation because it isolates the effect of a single prompt change. You split traffic 50/50 (or some other ratio) between a baseline (existing prompt) and a variant (new prompt), measure outcomes, and use statistics to determine if differences are real or due to noise.
A/B Test Design: Five Steps
Step 1: Define the baseline and variant. Specify exactly which prompt goes into each arm.
# experiment: customer-support-v2.0.0-test
experiment_id: "cust-support-2.0.0-001"
created_by: "[email protected]"
created_at: "2026-06-01T10:00:00Z"
baseline:
prompt_name: "customer-support"
version: "1.9.0"
description: "Current production prompt"
variant:
prompt_name: "customer-support"
version: "2.0.0-rc.1"
description: "Added step-by-step reasoning for refund decisions"
Step 2: Choose outcome metrics. Define what "success" looks like. Examples: customer satisfaction rating, refund approval accuracy, response time, thumbs-up/down votes.
class ExperimentMetrics:
"""Define metrics for an A/B test."""
metrics = {
"customer_satisfaction": {
"definition": "1-5 rating after interaction",
"aggregation": "mean",
"threshold": 4.0
},
"false_refund_rate": {
"definition": "% of refunds later disputed",
"aggregation": "sum",
"threshold": 0.05 # < 5% is good
},
"response_time_ms": {
"definition": "Inference time in milliseconds",
"aggregation": "median",
"threshold": 2000
},
"user_votes": {
"definition": "Thumbs-up / thumbs-down after interaction",
"aggregation": "binomial",
"threshold": 0.65 # > 65% thumbs-up
}
}
Step 3: Determine sample size. How many users/queries do you need to detect a meaningful difference with confidence?
from scipy import stats
def calculate_sample_size(
baseline_rate: float,
effect_size: float,
alpha: float = 0.05,
power: float = 0.80
) -> int:
"""
Calculate sample size needed to detect a change.
Args:
baseline_rate: current success rate (e.g., 0.85 for 85%)
effect_size: minimum meaningful change (e.g., 0.05 for 5% improvement)
alpha: type I error rate (false positive; default 0.05 = 5%)
power: type II error rate (1 - false negative; default 0.80 = 80%)
Returns:
Sample size per arm (total = n * 2)
"""
import math
# Cohen's h effect size for proportions
h = 2 * (math.asin(math.sqrt(baseline_rate + effect_size)) -
math.asin(math.sqrt(baseline_rate)))
# Use normal approximation
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
n = ((z_alpha + z_beta) / h) ** 2
return int(math.ceil(n))
# Example: customer satisfaction
# Baseline: 4.1/5 (mean rating)
# Desired improvement: 0.2 points (4.1 -> 4.3)
# We need ~ 630 users per arm, 1260 total
baseline_mean = 4.1
desired_mean = 4.3
std_dev = 0.9
# Use effect size (Cohen's d)
cohens_d = (desired_mean - baseline_mean) / std_dev
print(f"Effect size: {cohens_d:.2f}")
# Sample size for continuous metric
n = 2 * ((1.96 + 0.84) / cohens_d) ** 2
print(f"Sample size: {int(n)} per arm")
Step 4: Randomly assign users to arms. Use consistent hashing to ensure stability (same user always sees same variant).
import hashlib
def get_variant_for_user(user_id: str, experiment_id: str,
control_rate: float = 0.5) -> str:
"""
Deterministically assign a user to control or variant.
control_rate = 0.5 means 50/50 split.
"""
# Hash user + experiment to get a stable number 0-1
hash_input = f"{user_id}:{experiment_id}".encode()
hash_value = hashlib.md5(hash_input).hexdigest()
user_hash = int(hash_value, 16) % 10000 / 10000
if user_hash < control_rate:
return "control"
else:
return "variant"
# Test: same user always gets same variant
user_id = "user_12345"
exp_id = "cust-support-2.0.0-001"
print(get_variant_for_user(user_id, exp_id)) # "variant"
print(get_variant_for_user(user_id, exp_id)) # "variant" (consistent)
Step 5: Analyze results using statistics. After collecting enough data, compute p-values and confidence intervals.
import numpy as np
from scipy import stats
def analyze_ab_test(
control_outcomes: list,
variant_outcomes: list,
alpha: float = 0.05
) -> dict:
"""
Analyze A/B test results using a two-sample t-test.
Args:
control_outcomes: list of outcome values for control arm
variant_outcomes: list of outcome values for variant arm
alpha: significance level (default 0.05 = 95% confidence)
Returns:
dict with t-stat, p-value, confidence interval, and recommendation
"""
control_mean = np.mean(control_outcomes)
variant_mean = np.mean(variant_outcomes)
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(control_outcomes, variant_outcomes)
# Confidence interval for the difference
se = np.sqrt(np.var(control_outcomes) / len(control_outcomes) +
np.var(variant_outcomes) / len(variant_outcomes))
ci_lower = (variant_mean - control_mean) - 1.96 * se
ci_upper = (variant_mean - control_mean) + 1.96 * se
# Recommendation
is_significant = p_value < alpha
recommendation = "promote" if is_significant and variant_mean > control_mean else "revert"
return {
"control_mean": control_mean,
"variant_mean": variant_mean,
"improvement": variant_mean - control_mean,
"improvement_pct": ((variant_mean - control_mean) / control_mean) * 100,
"t_statistic": t_stat,
"p_value": p_value,
"ci_lower": ci_lower,
"ci_upper": ci_upper,
"is_significant": is_significant,
"recommendation": recommendation
}
# Example: satisfaction ratings
control_ratings = [4.1, 4.3, 3.9, 4.5, 4.0, 4.2, 3.8, 4.4]
variant_ratings = [4.4, 4.6, 4.2, 4.7, 4.3, 4.5, 4.1, 4.8]
results = analyze_ab_test(control_ratings, variant_ratings)
print(f"Improvement: {results['improvement_pct']:.1f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Recommendation: {results['recommendation']}")
Running Experiments in Production
Integrate A/B testing into your application:
from datetime import datetime
class ABTestManager:
def __init__(self, registry, event_logger):
self.registry = registry
self.event_logger = event_logger # logs outcomes for analysis
def get_prompt_for_user(self, user_id: str, prompt_name: str,
experiment_id: str) -> str:
"""
Get the appropriate prompt (control or variant) for this user.
"""
variant = get_variant_for_user(user_id, experiment_id)
if variant == "control":
prompt_version = "2.0.0"
else:
prompt_version = "2.0.0-rc.1"
prompt = self.registry.fetch(prompt_name, prompt_version)
# Log the assignment
self.event_logger.log_assignment(
user_id=user_id,
experiment_id=experiment_id,
variant=variant,
prompt_version=prompt_version,
timestamp=datetime.now().isoformat()
)
return prompt
def log_outcome(self, user_id: str, experiment_id: str,
outcome_name: str, outcome_value: float):
"""
Log an outcome (e.g., satisfaction rating) for a user.
"""
self.event_logger.log_outcome(
user_id=user_id,
experiment_id=experiment_id,
outcome_name=outcome_name,
outcome_value=outcome_value,
timestamp=datetime.now().isoformat()
)
# Usage in a chatbot
ab_mgr = ABTestManager(registry, event_logger)
system_prompt = ab_mgr.get_prompt_for_user(user_id="user_123",
prompt_name="customer-support",
experiment_id="cust-support-2.0.0-001")
response = call_model(system_prompt, user_message)
# After user rates the interaction
satisfaction = user_feedback["rating"] # 1-5
ab_mgr.log_outcome(user_id="user_123",
experiment_id="cust-support-2.0.0-001",
outcome_name="customer_satisfaction",
outcome_value=satisfaction)
Common Pitfalls
Peeking: Checking results before collecting the target sample size inflates false positives. Decide the sample size upfront; don't stop early just because results look good.
Simpson's Paradox: A variant might win overall but lose in important subgroups. Always stratify by user segment, geography, or query type.
# Stratified analysis: check variant performance per segment
segments = {"new_users": [control_new, variant_new],
"returning_users": [control_returning, variant_returning]}
for segment, (control, variant) in segments.items():
results = analyze_ab_test(control, variant)
print(f"{segment}: {results['recommendation']}")
Multiple comparisons: If you test multiple metrics, p-value thresholds become less reliable. Use Bonferroni correction: divide alpha by the number of tests.
alpha = 0.05
num_metrics = 4
alpha_corrected = alpha / num_metrics # 0.0125 instead of 0.05
Key Takeaways
- A/B tests isolate the effect of a prompt change by comparing it to a baseline under controlled conditions.
- Design tests upfront: define baseline, variant, metrics, and sample size.
- Use consistent hashing to ensure users always see the same variant across sessions.
- Analyze with statistics; report p-values, confidence intervals, and recommendation.
- Watch for Simpson's Paradox and multiple-comparison problems; stratify results.
Frequently Asked Questions
How long should an A/B test run?
Until you reach the target sample size, not a fixed time period. Running too short inflates Type I error; running too long wastes time. Most tests run 1–4 weeks depending on traffic volume.
What if both arms perform similarly?
A "draw" is still useful: you learned that the change doesn't hurt. You can ship it without risk. Or, re-run with a larger effect size target if the improvement is small but real.
Can I run multiple experiments simultaneously?
Yes, but randomize assignment carefully to avoid interaction effects. Assign users to meta-experiments (test 1: prompt A vs B; test 2: parameter X vs Y) independently.
Should I use Bayesian or frequentist statistics?
Both work. Frequentist (t-tests, p-values) is standard and interpretable. Bayesian allows you to incorporate prior beliefs (e.g., "variant is probably better"). Pick whichever your team understands.
How do I handle incomplete data (users who don't rate their experience)?
Analyze only completers. Or, treat non-responses as neutral (mid-scale ratings). Document your approach and report dropout rates per arm; high dropout in one arm suggests a problem.
Further Reading
- A/B Testing Best Practices from Google — Real-world lessons from large-scale experiments.
- Designing Experiments — Statistical foundations for A/B testing.
- Sequential Testing (SPRT) — Stop tests early without sacrificing validity.
- MLflow: Experiment Tracking for ML — A tool for logging and comparing experiments at scale.