Skip to main content

Statistical Significance Testing for LLM Improvements

A 2% improvement in your LLM's accuracy sounds good—until you realize it could be statistical noise. You ran 100 examples, your model improved on 52 vs. 48 before. Is that a real improvement or random variation? Statistical significance testing answers this: given your sample size and the observed difference, how likely is this result due to chance alone? In production 2026, teams that skip significance testing ship regressions masked as improvements. This article teaches you to compute effect sizes, run hypothesis tests, determine sample size requirements, and build confidence intervals around your evaluation results.

Effect Size and Sample Size Planning

Before running an experiment, estimate how many examples you need to detect an improvement. This depends on:

  1. Baseline accuracy (current performance)
  2. Minimum detectable effect (smallest improvement you care about)
  3. Statistical power (probability of detecting a true improvement; typically 0.80–0.90)
  4. Significance level (probability of false positive; typically 0.05)
from scipy.stats import norm
import math

def compute_sample_size_for_proportion(
baseline_proportion: float,
minimum_effect_size: float,
significance_level: float = 0.05,
power: float = 0.80
) -> int:
"""
Compute sample size needed to detect an improvement in binary metric
(e.g., "answer is correct" vs. "answer is wrong").

baseline_proportion: current accuracy (0–1)
minimum_effect_size: smallest improvement you care about (e.g., 0.03 for 3%)
significance_level: alpha (typically 0.05 for 95% confidence)
power: 1 - beta, probability of detecting effect if it exists (typically 0.80)

Returns: required sample size
"""

# Z-score for significance level (two-tailed)
z_alpha = norm.ppf(1 - significance_level / 2)

# Z-score for power
z_beta = norm.ppf(power)

p1 = baseline_proportion
p2 = baseline_proportion + minimum_effect_size

# Pooled proportion
p_pool = (p1 + p2) / 2

# Sample size formula for binary proportion
numerator = (z_alpha + z_beta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
denominator = (p1 - p2) ** 2

return math.ceil(numerator / denominator)

# Example: baseline 75% accuracy, want to detect 3% improvement with 80% power
sample_size = compute_sample_size_for_proportion(
baseline_proportion=0.75,
minimum_effect_size=0.03,
significance_level=0.05,
power=0.80
)

print(f"Required sample size: {sample_size}")
# Output: Required sample size: 1036
# Interpretation: You need ~1000 examples to reliably detect a 3% improvement

This is crucial: if your golden dataset has only 100 examples but you want to detect a 3% improvement, you're underpowered (high chance of missing real improvements). Plan your golden dataset size based on the improvements you care about detecting.

Hypothesis Testing: Null and Alternative Hypotheses

Set up the statistical test as a hypothesis:

  • Null hypothesis (H0): Model v2 = Model v1 (no real improvement)
  • Alternative hypothesis (H1): Model v2 > Model v1 (improvement exists)

Then run a test to reject or fail to reject H0.

from scipy.stats import binom_test, chi2_contingency

def test_improvement_binomial(
improved: int,
total: int,
baseline_rate: float = 0.5,
alternative: str = 'greater'
) -> dict:
"""
Binomial test: Did model v2 show improvement over baseline?

improved: number of examples where v2 beats v1
total: total examples evaluated
baseline_rate: if no real difference, expect baseline_rate wins (0.5 for tied)
alternative: 'greater' (v2 better), 'less' (v2 worse), 'two-sided' (different)

Returns: p-value and interpretation
"""
p_value = binom_test(
improved,
total,
baseline_rate,
alternative=alternative
)

win_rate = improved / total

return {
'win_rate': win_rate,
'p_value': p_value,
'is_significant': p_value < 0.05,
'interpretation': (
f"Model v2 won {improved}/{total} ({win_rate:.1%}) comparisons. "
f"p={p_value:.4f}. "
f"{'Statistically significant improvement' if p_value < 0.05 else 'Not significant'} at alpha=0.05."
)
}

# Example: v2 beats v1 on 58 of 100 examples
result = test_improvement_binomial(improved=58, total=100, baseline_rate=0.5)
print(result['interpretation'])
# Output: Model v2 won 58/100 (58.0%) comparisons. p=0.0546. Not significant at alpha=0.05.
# Note: 58% looks good but isn't statistically significant with only 100 samples!

# With more examples (500 total, 58% still):
result_large = test_improvement_binomial(improved=290, total=500, baseline_rate=0.5)
print(result_large['interpretation'])
# Output: ...p=0.0038. Statistically significant improvement at alpha=0.05.

The key insight: with n=100, a 58% win rate is noise. With n=500, the same win rate is significant. Sample size matters immensely.

Confidence Intervals Around Metrics

Report not just a point estimate but a confidence interval. This shows uncertainty.

def compute_confidence_interval_proportion(
successes: int,
total: int,
confidence: float = 0.95
) -> tuple:
"""
Compute confidence interval for a binary metric (e.g., accuracy).
Uses Wilson score interval (better for small samples than normal approximation).

Returns: (lower_bound, point_estimate, upper_bound)
"""
from scipy.stats import norm

p = successes / total

if total == 0:
return (0, 0, 1)

z = norm.ppf((1 + confidence) / 2) # e.g., z=1.96 for 95% CI

denominator = 1 + z**2 / total

center = (p + z**2 / (2 * total)) / denominator
margin = z * math.sqrt(p * (1 - p) / total + z**2 / (4 * total**2)) / denominator

lower = max(0, center - margin)
upper = min(1, center + margin)

return (lower, p, upper)

# Example: 75% accuracy on 100 examples
lower, point, upper = compute_confidence_interval_proportion(75, 100, confidence=0.95)
print(f"Accuracy: {point:.1%} (95% CI: [{lower:.1%}, {upper:.1%}])")
# Output: Accuracy: 75.0% (95% CI: [65.0%, 83.4%])
# Interpretation: We're 95% confident the true accuracy is between 65% and 83%.

Always report confidence intervals in production. A single number hides uncertainty and invites over-confident decisions.

Multi-Metric Significance

Most evaluation uses multiple metrics (accuracy, speed, coverage). Test improvement on all metrics together.

def test_improvement_multiple_metrics(
metric_results: dict,
baseline_results: dict
) -> dict:
"""
Test improvement across multiple metrics using Bonferroni correction.
metric_results, baseline_results: {metric_name: score}

Returns: per-metric significance + overall recommendation
"""
from scipy.stats import norm

num_metrics = len(metric_results)
bonferroni_alpha = 0.05 / num_metrics # Correct for multiple comparisons

results = {
'per_metric': {},
'bonferroni_corrected_alpha': bonferroni_alpha,
'recommendation': None
}

improved_count = 0

for metric_name, new_score in metric_results.items():
baseline_score = baseline_results.get(metric_name, 0)
improvement = new_score - baseline_score

# Simplified: treat improvement as normally distributed
# In practice, use appropriate test per metric

results['per_metric'][metric_name] = {
'baseline': baseline_score,
'new': new_score,
'improvement': improvement,
'significant': abs(improvement) > 0.05 # Dummy threshold
}

if improvement > 0:
improved_count += 1

# Recommendation: only claim improvement if >50% of metrics improved
# and none regressed significantly
regressions = [
m for m, r in results['per_metric'].items()
if r['improvement'] < -0.05
]

if len(regressions) > 0:
results['recommendation'] = (
f"REGRESSION DETECTED on {len(regressions)} metric(s). "
f"Do not deploy: {', '.join(regressions)}"
)
elif improved_count >= num_metrics * 0.5:
results['recommendation'] = "SAFE TO DEPLOY: improvements detected on majority of metrics."
else:
results['recommendation'] = "INCONCLUSIVE: no clear improvement or regression."

return results

Never improve on one metric at the expense of another. A 5% accuracy improvement that doubles latency is not a net win. Multi-metric testing prevents one-dimensional optimization.

P-Hacking and Multiple Comparisons Problem

Beware of p-hacking: running many tests, then reporting only the "significant" results. This inflates false positives.

def detect_p_hacking(
metric_p_values: dict,
num_hypotheses_tested: int = None
) -> dict:
"""
Detect signs of p-hacking: too many low p-values by chance.
Uses Benjamini-Hochberg FDR correction for multiple comparisons.
"""
from scipy.stats import binom_test

p_vals = list(metric_p_values.values())
num_hypotheses = num_hypotheses_tested or len(p_vals)

# Sort p-values
sorted_pvals = sorted(enumerate(p_vals), key=lambda x: x[1])

# Benjamini-Hochberg: reject hypotheses with adjusted p < 0.05
adjusted_results = {}

for rank, (idx, pval) in enumerate(sorted_pvals):
adjusted_p = pval * num_hypotheses / (rank + 1)
adjusted_results[idx] = {
'original_p': pval,
'adjusted_p': min(1.0, adjusted_p),
'reject_h0': adjusted_p < 0.05
}

return {
'tests_conducted': num_hypotheses,
'significant_at_alpha_0_05_uncorrected': sum(1 for p in p_vals if p < 0.05),
'significant_after_fdr_correction': sum(
1 for r in adjusted_results.values() if r['reject_h0']
),
'adjusted_results': adjusted_results
}

If you test 20 metrics independently, expect ~1 false positive by chance (0.05 * 20). Use Bonferroni or Benjamini-Hochberg correction when testing multiple metrics.

Key Takeaways

  • Plan sample size upfront: Compute required n based on baseline, effect size, and power before collecting data.
  • Report p-values and confidence intervals: A 75% accuracy with 95% CI [65%, 83%] is honest; "75% accuracy" is not.
  • Use multi-metric testing: Test improvement across all metrics; don't celebrate one metric while ignoring others.
  • Correct for multiple comparisons: Bonferroni or Benjamini-Hochberg adjustment when testing 5+ metrics.
  • Be skeptical of small improvements on small samples: A 58% win rate on 100 examples is noise. Require n=300+ for 5% improvements.

Frequently Asked Questions

What's the difference between significance and importance?

Significance: does an observed difference likely reflect a real effect? Importance: does the effect size matter for your application? A 0.1% improvement might be statistically significant on 100,000 examples but unimportant for UX.

Should I always use alpha=0.05?

0.05 is convention but not law. For high-stakes decisions (deploying to production), use alpha=0.01 (stronger evidence required). For exploratory analysis, 0.10 is acceptable. Document your choice.

How do I choose between one-tailed and two-tailed tests?

Use two-tailed if you care about improvement or regression. Use one-tailed only if you're sure any change in one direction is meaningless (rare). Two-tailed is safer and more defensible.

Can I use statistical significance to validate model choice?

Yes, but it's not sufficient. A model with 75.1% vs. 75.0% accuracy might be statistically significantly better on 100,000 examples but not practically better. Always pair significance with effect size.

What if my metric isn't binary (e.g., semantic similarity score)?

Use t-tests instead of binomial tests. Compute mean and std dev of both models, then run a two-sample t-test. See scipy.stats.ttest_ind for implementation.

Further Reading