Skip to main content

Statistical Analysis of Prompt Results: Measuring Impact

Statistical analysis is the rigorous method for determining whether a prompt change actually improved performance or if observed differences are just noise. Without statistics, you cannot distinguish signal from random variation. With statistics, you can make claims like "the new prompt is 8% more accurate with 95% confidence" instead of "it seems better."

Statistical analysis answers three questions: (1) Is the difference real? (2) How big is it? (3) How confident are we?

Core Statistical Concepts

Null hypothesis (H0): There is no difference between the baseline and variant. Assume this is true unless proven otherwise.

Alternative hypothesis (H1): There is a difference. We reject H0 if evidence is strong enough.

P-value: The probability of observing a difference this large (or larger) if H0 were true. A p-value < 0.05 means there's less than a 5% chance the difference is due to randomness. If p < 0.05, we reject H0 and conclude the difference is statistically significant.

Confidence interval: A range around the observed difference that likely contains the true difference (with 95% confidence). E.g., "the variant is 5% more accurate, with 95% CI [2%, 8%]" means we're 95% confident the true improvement is between 2% and 8%.

Effect size: The magnitude of the difference, independent of sample size. E.g., Cohen's d = 0.5 is a "medium" effect. Small effects require large sample sizes to detect; large effects are detectable with small samples.

Two-Sample T-Test

Use a t-test to compare the mean outcome of two groups (control and variant). You have data like satisfaction ratings [4.1, 3.9, 4.3, 4.5, ...] for control and [4.4, 4.2, 4.6, 4.7, ...] for variant.

import numpy as np
from scipy import stats

def two_sample_ttest(control: list, variant: list, alpha: float = 0.05) -> dict:
"""
Perform a two-sample t-test comparing control and variant.

Args:
control: list of outcomes for control arm
variant: list of outcomes for variant arm
alpha: significance level (typically 0.05)

Returns:
Dictionary with test results and recommendation
"""
control = np.array(control)
variant = np.array(variant)

# Compute descriptive statistics
control_mean = np.mean(control)
variant_mean = np.mean(variant)
control_std = np.std(control, ddof=1)
variant_std = np.std(variant, ddof=1)

# Perform t-test
# equal_var=True assumes equal variance (safer; conservative)
t_stat, p_value = stats.ttest_ind(control, variant, equal_var=True)

# Compute confidence interval for the difference
n_control = len(control)
n_variant = len(variant)
df = n_control + n_variant - 2

# Pooled standard error
pooled_std = np.sqrt(((n_control - 1) * control_std**2 + (n_variant - 1) * variant_std**2) / df)
se_diff = pooled_std * np.sqrt(1 / n_control + 1 / n_variant)

t_critical = stats.t.ppf(1 - alpha / 2, df)
ci_lower = (variant_mean - control_mean) - t_critical * se_diff
ci_upper = (variant_mean - control_mean) + t_critical * se_diff

# Effect size (Cohen's d)
cohens_d = (variant_mean - control_mean) / pooled_std

# Recommendation
is_significant = p_value < alpha
recommendation = "promote" if is_significant and variant_mean > control_mean else "no promotion"

return {
"control_mean": control_mean,
"variant_mean": variant_mean,
"difference": variant_mean - control_mean,
"difference_pct": ((variant_mean - control_mean) / control_mean) * 100,
"t_statistic": t_stat,
"p_value": p_value,
"ci_lower": ci_lower,
"ci_upper": ci_upper,
"cohens_d": cohens_d,
"is_significant": is_significant,
"recommendation": recommendation,
"control_n": n_control,
"variant_n": n_variant
}

# Example: customer satisfaction ratings (1-5 scale)
control_ratings = [4.1, 3.9, 4.3, 4.5, 4.0, 4.2, 3.8, 4.4, 4.1, 4.0]
variant_ratings = [4.4, 4.2, 4.6, 4.7, 4.3, 4.5, 4.1, 4.8, 4.4, 4.3]

results = two_sample_ttest(control_ratings, variant_ratings)
print(f"Control mean: {results['control_mean']:.2f}")
print(f"Variant mean: {results['variant_mean']:.2f}")
print(f"Difference: {results['difference']:.2f} ({results['difference_pct']:.1f}%)")
print(f"95% CI: [{results['ci_lower']:.2f}, {results['ci_upper']:.2f}]")
print(f"P-value: {results['p_value']:.4f}")
print(f"Cohen's d: {results['cohens_d']:.2f}")
print(f"Recommendation: {results['recommendation']}")

Two-Proportion Z-Test

For binary outcomes (pass/fail, approved/rejected), use a z-test instead of a t-test.

from scipy import stats

def two_proportion_ztest(
control_successes: int,
control_total: int,
variant_successes: int,
variant_total: int,
alpha: float = 0.05
) -> dict:
"""
Perform a two-proportion z-test.

Args:
control_successes: number of successes in control
control_total: total trials in control
variant_successes: number of successes in variant
variant_total: total trials in variant

Returns:
Dictionary with test results
"""
control_prop = control_successes / control_total
variant_prop = variant_successes / variant_total

# Pooled proportion for null hypothesis
pooled_prop = (control_successes + variant_successes) / (control_total + variant_total)

# Standard error under null
se = np.sqrt(pooled_prop * (1 - pooled_prop) * (1/control_total + 1/variant_total))

# Z-statistic
z_stat = (variant_prop - control_prop) / se

# P-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# Confidence interval
se_ci = np.sqrt(variant_prop * (1 - variant_prop) / variant_total +
control_prop * (1 - control_prop) / control_total)
ci_lower = (variant_prop - control_prop) - 1.96 * se_ci
ci_upper = (variant_prop - control_prop) + 1.96 * se_ci

is_significant = p_value < alpha
recommendation = "promote" if is_significant and variant_prop > control_prop else "no promotion"

return {
"control_prop": control_prop,
"variant_prop": variant_prop,
"difference": variant_prop - control_prop,
"difference_pct": ((variant_prop - control_prop) / control_prop) * 100,
"z_statistic": z_stat,
"p_value": p_value,
"ci_lower": ci_lower,
"ci_upper": ci_upper,
"is_significant": is_significant,
"recommendation": recommendation
}

# Example: refund approval accuracy
# Control: 150 correct out of 200, Variant: 170 correct out of 200
results = two_proportion_ztest(
control_successes=150,
control_total=200,
variant_successes=170,
variant_total=200
)
print(f"Control accuracy: {results['control_prop']:.1%}")
print(f"Variant accuracy: {results['variant_prop']:.1%}")
print(f"Improvement: {results['difference_pct']:.1f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Recommendation: {results['recommendation']}")

Interpreting P-Values and Confidence Intervals

P-value < 0.001: Very strong evidence against H0. Almost certainly a real difference.

P-value 0.001–0.05: Strong evidence against H0. Likely a real difference.

P-value 0.05–0.10: Weak evidence. Inconclusive. Run the experiment longer or try a different approach.

P-value > 0.10: No significant evidence. Treat as no difference.

Confidence interval (CI): If the CI includes zero (e.g., [-1%, 3%]), the difference is not significant. If it excludes zero (e.g., [2%, 8%]), the difference is significant. The width of the CI reflects precision: narrow CIs (e.g., [5%, 7%]) mean high precision; wide CIs (e.g., [1%, 11%]) mean low precision.

Reporting Results Clearly

Write a results table and narrative:

## Results

### Primary Metric: Refund Approval Accuracy

| Metric | Control | Variant | Difference | 95% CI | P-value | Significant? |
|--------|---------|---------|-----------|--------|---------|--------------|
| Accuracy | 75.0% | 80.0% | +5.0% | [+1.2%, +8.8%] | 0.008 | Yes |

**Interpretation:** The variant prompt improves accuracy by 5.0 percentage points.
We are 95% confident the true improvement is between 1.2% and 8.8%.
The p-value (0.008) is well below 0.05, indicating this is not due to chance.

### Secondary Metrics (Must Not Regress)

| Metric | Control | Variant | Difference | P-value | Status |
|--------|---------|---------|-----------|---------|--------|
| Customer satisfaction | 4.10/5 | 4.12/5 | +0.02 | 0.67 | No change (green) |
| Inference latency p99 | 2200 ms | 2180 ms | -20 ms | 0.45 | No change (green) |

**Interpretation:** Secondary metrics are stable. The variant does not hurt user experience
or performance, confirming the improvement is genuine.

## Recommendation: PROMOTE to Staging

Avoiding Common Pitfalls

Peeking: Checking results mid-experiment inflates false positives. Fix the sample size upfront; resist the urge to check early.

Multiple comparisons: If you test 10 metrics, you expect ~0.5 false positives by chance. Adjust your alpha (Bonferroni: divide alpha by number of tests).

alpha = 0.05
num_metrics = 5
alpha_corrected = alpha / num_metrics # 0.01, stricter threshold
print(f"Use p < {alpha_corrected} for significance")

Non-independent samples: If the same user appears in both control and variant (because you didn't randomize properly), results are biased. Use paired t-tests if samples are matched; use independent t-tests if randomized.

Unequal variances: If control and variant have very different standard deviations, use Welch's t-test (assumes unequal variances).

# Welch's t-test (unequal variance)
t_stat, p_value = stats.ttest_ind(control, variant, equal_var=False)

Effect Size Interpretation

Cohen's guidelines for effect size (d):

Effect SizedInterpretation
Small0.2Noticeable but small difference
Medium0.5Moderate difference
Large0.8Large, obvious difference
Very Large> 1.2Huge difference

A d = 0.5 (medium) is often the threshold for a worthwhile improvement.

Key Takeaways

  • P-values < 0.05 indicate statistical significance (95% confidence).
  • Confidence intervals show the range of plausible true differences.
  • T-tests are for continuous outcomes; z-tests are for binary outcomes.
  • Always report means, differences, CIs, and p-values; allow readers to judge magnitude.
  • Avoid peeking and multiple-comparisons problems.

Frequently Asked Questions

Can I use a t-test with small samples (n < 30)?

Yes, but the test becomes less reliable if the data is non-normal. Use Welch's t-test or a non-parametric test (Mann-Whitney U) if data looks skewed.

What if my outcome is non-binary but not normally distributed?

Use the Mann-Whitney U test (non-parametric). It doesn't assume normality but has slightly less power than the t-test.

How do I interpret a p-value of exactly 0.05?

Borderline. Conventionally, p = 0.05 is the threshold, but it's arbitrary. A p = 0.049 is not dramatically different from p = 0.051. If borderline, run the experiment longer or report as inconclusive.

Should I use one-tailed or two-tailed tests?

Two-tailed (default). One-tailed tests are more powerful but require you to pre-specify the direction. Use one-tailed only if you're 100% certain the variant cannot be worse.

Can I stop the experiment early if results look good?

No; this inflates false positives. Define the stopping rule (sample size or max duration) upfront. If you must stop early, apply a correction (e.g., use a stricter p-threshold).

Further Reading