Designing Prompt Experiments: Hypothesis-Driven Testing
Designing a prompt experiment is applying the scientific method to prompt iteration. Instead of trying random variations and hoping one is better, you form a hypothesis ("adding reasoning steps will improve accuracy"), design a test, collect data, and analyze results. Hypothesis-driven experimentation is vastly more efficient than trial-and-error and produces reproducible, defensible results.
A well-designed prompt experiment has five elements: a clear hypothesis, a baseline prompt, a variant prompt that tests the hypothesis, defined success metrics, and a sample size target. Without any of these, you have a hunch, not an experiment.
The Five Elements of Prompt Experiment Design
Element 1: The Hypothesis states what you expect to improve and why. Write it in the form: "If I [change], then [outcome] will [improve], because [mechanism]."
Hypothesis 1: "If I add step-by-step reasoning instructions to the system prompt, then refund approval accuracy will improve by 5%, because explicit reasoning reduces hasty decisions."
Hypothesis 2: "If I reduce the system prompt from 400 to 200 words by removing redundant examples, then inference latency will decrease 15%, because shorter context reduces token processing."
Hypothesis 3: "If I add a safety guardrail warning about hallucinations, then false-fact rate will decrease 10%, because explicit warnings improve factual caution."
A hypothesis must be testable (not "the variant will feel more natural"). It should be specific (5% improvement, not "better"). It should articulate a mechanism (because...).
Element 2: The Baseline is the current production prompt or a well-established variant. You measure everything against it. Baseline is immutable; you don't optimize it during the experiment.
baseline:
name: "customer-support"
version: "1.9.0"
description: "Current production (live as of 2026-05-01)"
created_at: "2026-04-01T09:00:00Z"
context: "Handles refund requests, complaints, escalations"
Element 3: The Variant is a single, focused change. Change one thing at a time (add reasoning, OR simplify, OR add guardrails). Multi-variable changes make it impossible to know what caused improvement.
variant:
name: "customer-support"
version: "2.0.0-rc.1"
description: "Baseline + step-by-step reasoning in refund decision logic"
change_description: |
Added a 'reasoning' section to the system prompt:
"Explain your reasoning for each refund decision in 2-3 sentences
before announcing the decision. Consider: item condition, return window,
customer history, and return reason."
created_at: "2026-05-20T14:00:00Z"
Element 4: Success Metrics are the outcomes you'll measure. Define them before running the experiment. Examples: accuracy, latency, cost, user satisfaction.
class ExperimentMetrics:
"""Define how success is measured."""
metrics = {
# Primary metric: what the hypothesis directly claims
"refund_approval_accuracy": {
"definition": "% of refund approvals that are correct (not disputed later)",
"aggregation": "mean",
"target_improvement": 0.05, # 5% improvement
"direction": "higher"
},
# Secondary metrics: what should not regress
"customer_satisfaction": {
"definition": "Post-interaction rating (1-5 scale)",
"aggregation": "mean",
"target_floor": 4.0, # Must stay >= 4.0
"direction": "higher"
},
"inference_latency_ms": {
"definition": "Time from input to model output (ms)",
"aggregation": "p99",
"target_floor": 2500, # Must stay <= 2500 ms
"direction": "lower"
}
}
Element 5: Sample Size is the number of users or interactions you need to detect the target improvement with statistical confidence. Calculate upfront; don't decide during the experiment.
from scipy import stats
import math
def calculate_sample_size_for_proportion(
baseline_rate: float,
target_improvement: float,
alpha: float = 0.05,
beta: float = 0.20
) -> int:
"""
Calculate sample size for comparing two proportions.
Args:
baseline_rate: baseline success rate (e.g., 0.75)
target_improvement: desired improvement (e.g., 0.05 for 5%)
alpha: false positive rate (default 0.05 = 5%)
beta: false negative rate (default 0.20 = 20%)
Returns:
Sample size per arm
"""
variant_rate = baseline_rate + target_improvement
# Effect size (Cohen's h for proportions)
h = 2 * (math.asin(math.sqrt(variant_rate)) - math.asin(math.sqrt(baseline_rate)))
# Critical values
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(1 - beta)
# Sample size formula
n = ((z_alpha + z_beta) / h) ** 2
return int(math.ceil(n))
# Example: refund accuracy
baseline_accuracy = 0.75
target_improvement = 0.05 # 5% absolute improvement (75% -> 80%)
n = calculate_sample_size_for_proportion(baseline_accuracy, target_improvement)
print(f"Sample size needed: {n} interactions per arm, {n*2} total")
# Output: 1170 interactions per arm, 2340 total
End-to-End Experiment Design Document
Create a document before running the experiment:
# Prompt Experiment: Step-by-Step Reasoning for Refunds
## Hypothesis
If I add step-by-step reasoning instructions to the refund decision prompt,
then refund approval accuracy will improve by 5%, because explicit reasoning
reduces hasty and context-unaware decisions.
## Baseline
- Prompt: customer-support:v1.9.0
- Live since: 2026-04-01
- Current accuracy: 75% (non-disputed refunds)
- Sample size needed: 1170 per arm (80% power, alpha 0.05)
## Variant
- Prompt: customer-support:v2.0.0-rc.1
- Change: Add "explain your reasoning before deciding" instruction
- Diff: [show exact prompt changes]
## Success Metrics
| Metric | Baseline | Target | Direction |
|--------|----------|--------|-----------|
| Refund approval accuracy | 75% | >= 80% | Higher |
| Customer satisfaction | 4.1/5 | >= 4.0/5 | Higher |
| Inference latency p99 | 2200 ms | <= 2500 ms | Lower |
## Experimental Design
- Traffic split: 50/50 (control vs. variant)
- Duration: Until 2340 interactions collected (expected 1-2 weeks)
- Success criteria: Accuracy improves 5%+ AND satisfaction unchanged AND latency unchanged
- Rollback criteria: Accuracy drops OR satisfaction drops > 0.1 points OR latency exceeds 3000 ms p99
## Analysis Plan
- Primary: two-proportion z-test for approval accuracy
- Secondary: linear regression to check for segment effects (new vs. returning users)
- Decision rule: Promote if accuracy improvement p < 0.05 AND other metrics green
Running the Experiment
Implement tracking at inference time:
from datetime import datetime
from typing import Optional
class ExperimentRunner:
def __init__(self, registry, event_logger):
self.registry = registry
self.logger = event_logger
def run_inference_with_tracking(
self,
user_id: str,
prompt_name: str,
experiment_id: str,
user_input: str,
metadata: Optional[dict] = None
) -> str:
"""
Run inference as part of an experiment.
Track which prompt variant and outcome.
"""
import anthropic
# Assign user to control or variant
assignment = self._get_assignment(user_id, experiment_id)
prompt_version = assignment["variant_version"]
# Fetch the prompt
prompt = self.registry.fetch(prompt_name, version=prompt_version)
# Run inference
client = anthropic.Anthropic()
start_time = datetime.now()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system=prompt,
messages=[{"role": "user", "content": user_input}]
)
latency_ms = (datetime.now() - start_time).total_seconds() * 1000
# Log the event
self.logger.log_inference(
user_id=user_id,
experiment_id=experiment_id,
prompt_version=prompt_version,
assignment=assignment["arm"],
latency_ms=latency_ms,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
timestamp=datetime.now().isoformat(),
metadata=metadata
)
return response.content[0].text
def log_outcome(
self,
user_id: str,
experiment_id: str,
outcome_name: str,
outcome_value: float
):
"""
Log an outcome (e.g., accuracy, satisfaction rating).
Called after the interaction is complete.
"""
self.logger.log_outcome(
user_id=user_id,
experiment_id=experiment_id,
outcome_name=outcome_name,
outcome_value=outcome_value,
timestamp=datetime.now().isoformat()
)
def _get_assignment(self, user_id: str, experiment_id: str) -> dict:
"""Deterministically assign user to control or variant."""
user_hash = hash(f"{user_id}:{experiment_id}") % 100
if user_hash < 50:
return {"arm": "control", "variant_version": "1.9.0"}
else:
return {"arm": "variant", "variant_version": "2.0.0-rc.1"}
Analyzing Results
After collecting the target sample size:
import numpy as np
from scipy import stats
class ExperimentAnalysis:
def __init__(self, event_store):
self.events = event_store
def analyze_experiment(self, experiment_id: str) -> dict:
"""Analyze complete experiment results."""
# Fetch all events for this experiment
control_outcomes = self.events.fetch_outcomes(experiment_id, arm="control")
variant_outcomes = self.events.fetch_outcomes(experiment_id, arm="variant")
# Analyze primary metric: approval accuracy
control_accuracy = sum(control_outcomes) / len(control_outcomes)
variant_accuracy = sum(variant_outcomes) / len(variant_outcomes)
improvement_pct = (variant_accuracy - control_accuracy) / control_accuracy * 100
# Two-proportion z-test
z_stat, p_value = stats.ttest_ind(
np.array(control_outcomes),
np.array(variant_outcomes)
)
is_significant = p_value < 0.05
improvement_direction = "improvement" if variant_accuracy > control_accuracy else "regression"
return {
"experiment_id": experiment_id,
"control_accuracy": control_accuracy,
"variant_accuracy": variant_accuracy,
"improvement_pct": improvement_pct,
"improvement_direction": improvement_direction,
"p_value": p_value,
"is_significant": is_significant,
"control_sample": len(control_outcomes),
"variant_sample": len(variant_outcomes),
"recommendation": "promote" if is_significant and improvement_direction == "improvement" else "revert"
}
Key Takeaways
- A hypothesis must be testable, specific, and articulate a mechanism for improvement.
- Five elements: hypothesis, baseline, variant (one change), metrics, sample size.
- Calculate sample size upfront; design the experiment before running it.
- Track assignments and outcomes scrupulously for analysis.
- Analyze with statistics; recommend promotion only if improvement is significant.
Frequently Asked Questions
Can I test two variants against one baseline?
Yes (three-arm test). Calculate sample size for the comparison that matters most. You'll need roughly 50% more data than a two-arm test.
How do I choose my target improvement?
Choose based on business impact. A 2% accuracy improvement might save 10 customer support hours/day. Is that worth the effort? If yes, design the test for 2% improvement.
Should I run multiple experiments in parallel?
Yes, on different prompts or user segments. Avoid overlapping experiments on the same prompt; they'll confound results.
What if the results are borderline (p = 0.08)?
Inconclusive. Either run the experiment longer to collect more data, or declare no winner and try a different approach. Don't ship a borderline improvement.
Can I change my hypothesis after seeing results?
No; you'd be p-hacking. Hypotheses must be pre-specified. If you want to explore, run a follow-up experiment.
Further Reading
- The Scientific Method in Product Development — Nielsen Norman Group on user research experiments.
- Experimentation at Google — Google's framework for A/B testing at scale.
- Statistical Power and Sample Size — Khan Academy statistics intro.
- MLflow: Experiment Tracking — Open-source tool for experiment logging.