Running Regression Tests for RAG Systems
Regression testing prevents changes from degrading RAG quality. A regression test compares metrics (faithfulness, context relevance, citation accuracy) from your latest code against a baseline, alerting you if quality drops. Without regression tests, you risk deploying harmful changes—a seemingly innocent prompt tweak might increase hallucinations or reduce answer relevance by several percentage points.
Regression testing for RAG is different from traditional software testing. You do not test deterministic outputs (same input always produces same output). Instead, you compare statistical properties: does the new system's faithfulness score stay above 0.75 on average? This requires careful baseline establishment and threshold tuning.
Establishing a Baseline
A baseline is a snapshot of your RAG system's metrics on a fixed golden dataset. Use this as the reference for detecting regressions. Baseline metrics should be recorded alongside your code version, prompt text, and model version for reproducibility.
import json
from datetime import datetime
from dataclasses import asdict
@dataclass
class RAGBaseline:
"""Frozen snapshot of RAG system metrics."""
version: str # e.g., "1.0.0"
timestamp: str # ISO 8601 datetime
model_version: str # e.g., "gpt-4-2024-01"
prompt_hash: str # SHA256 of system prompt
retriever: str # e.g., "bm25", "dense-embedding"
metrics: Dict[str, float] # Aggregate scores
examples_count: int
def save(self, filepath: str):
"""Save baseline to JSON file."""
with open(filepath, "w") as f:
json.dump(asdict(self), f, indent=2)
@staticmethod
def load(filepath: str) -> "RAGBaseline":
"""Load baseline from JSON file."""
with open(filepath) as f:
data = json.load(f)
return RAGBaseline(**data)
# Example baseline
baseline = RAGBaseline(
version="1.0.0",
timestamp=datetime.now().isoformat(),
model_version="gpt-4-2024-01",
prompt_hash="abc123def456",
retriever="bm25-lucene",
metrics={
"faithfulness": 0.82,
"answer_relevance": 0.79,
"context_precision": 0.71,
"context_recall": 0.88,
"citation_precision": 0.95
},
examples_count=150
)
baseline.save("baselines/v1.0.0.json")
Building a Regression Test Harness
Create a test suite that runs evaluation on your golden dataset and compares metrics to the baseline. Fail the test if any metric drops by more than a specified threshold.
import pytest
from pathlib import Path
import json
class RAGRegressionTester:
"""Harness for continuous RAG regression testing."""
def __init__(self, golden_dataset_path: str,
baseline_path: str,
regression_thresholds: Dict[str, float]):
"""
Args:
golden_dataset_path: Path to golden dataset JSON.
baseline_path: Path to baseline metrics JSON.
regression_thresholds: Acceptable drops per metric, e.g.,
{"faithfulness": 0.05} allows 5% drop.
"""
self.golden_dataset = self._load_dataset(golden_dataset_path)
self.baseline = RAGBaseline.load(baseline_path)
self.regression_thresholds = regression_thresholds
def _load_dataset(self, path: str) -> List[Dict]:
"""Load golden dataset from JSON."""
with open(path) as f:
return json.load(f)
def run_evaluation(self, rag_system) -> Dict[str, float]:
"""
Evaluate RAG system on golden dataset.
Args:
rag_system: Callable that takes query and returns
(answer, retrieved_passages).
Returns:
Aggregated metrics dict.
"""
all_scores = {metric: [] for metric in self.baseline.metrics.keys()}
for example in self.golden_dataset:
query = example["query"]
answer, passages = rag_system(query)
# Compute metrics (using RAGAS or custom evaluator)
# This is pseudo-code; integrate your actual evaluator
scores = evaluate_single_example(answer, passages, query)
for metric, score in scores.items():
all_scores[metric].append(score)
# Compute aggregates
aggregate_scores = {
metric: sum(scores) / len(scores)
for metric, scores in all_scores.items()
if scores
}
return aggregate_scores
def check_regressions(self, new_scores: Dict[str, float]) -> List[str]:
"""
Compare new scores to baseline. Return list of regressions.
Args:
new_scores: Latest metrics from run_evaluation.
Returns:
List of strings describing detected regressions.
"""
regressions = []
for metric_name, threshold in self.regression_thresholds.items():
if metric_name not in self.baseline.metrics:
continue
baseline_value = self.baseline.metrics[metric_name]
new_value = new_scores.get(metric_name, baseline_value)
# Compute fractional drop
drop = (baseline_value - new_value) / baseline_value
if drop > threshold:
regressions.append(
f"{metric_name}: {baseline_value:.3f} -> {new_value:.3f} "
f"(drop {drop:.1%}, threshold {threshold:.1%})"
)
return regressions
def test_no_regression(self, rag_system) -> bool:
"""
Main test method to use in pytest.
Args:
rag_system: RAG system to test.
Raises:
AssertionError if regressions detected.
"""
print(f"Running regression test against baseline v{self.baseline.version}...")
new_scores = self.run_evaluation(rag_system)
regressions = self.check_regressions(new_scores)
if regressions:
error_msg = "Regressions detected:\n" + "\n".join(regressions)
raise AssertionError(error_msg)
print(f"PASS: All metrics within thresholds")
return True
# Example pytest integration
@pytest.fixture
def tester():
return RAGRegressionTester(
golden_dataset_path="data/golden_dataset.json",
baseline_path="baselines/v1.0.0.json",
regression_thresholds={
"faithfulness": 0.05, # Allow 5% drop
"answer_relevance": 0.05,
"context_precision": 0.10, # More lenient for noisier metric
"citation_precision": 0.02 # Strict for citation accuracy
}
)
def test_rag_no_regression(tester):
"""Pytest test case for RAG regression detection."""
rag_system = lambda query: get_rag_output(query)
tester.test_no_regression(rag_system)
Integrating Regression Tests into CI/CD
Add regression tests to your continuous integration pipeline (GitHub Actions, GitLab CI, Jenkins) to automatically catch quality drops before merging.
# Example: GitHub Actions workflow
name: RAG Regression Tests
on: [pull_request, push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-cov
- name: Download golden dataset
run: |
# Download or generate golden dataset
python scripts/download_golden_dataset.py
- name: Run RAG regression tests
run: |
pytest tests/test_rag_regression.py -v --tb=short
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
RAG_MODEL: "gpt-4"
- name: Generate regression report
if: always()
run: python scripts/generate_regression_report.py
- name: Comment PR with results
if: github.event_name == 'pull_request'
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const report = fs.readFileSync('regression_report.md', 'utf-8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: report
});
Tracking Metrics Over Time
Maintain a historical log of regression test results to detect trends and identify when quality started degrading.
import csv
from pathlib import Path
from datetime import datetime
class RegressionMetricsLog:
"""Track regression test results over time."""
def __init__(self, log_file: str = "regression_metrics_log.csv"):
self.log_file = log_file
self._ensure_header()
def _ensure_header(self):
"""Create CSV with header if not exists."""
if not Path(self.log_file).exists():
with open(self.log_file, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=[
"timestamp", "commit_sha", "branch",
"faithfulness", "answer_relevance",
"context_precision", "context_recall",
"status" # "pass" or "fail"
])
writer.writeheader()
def log_result(self, commit_sha: str, branch: str,
scores: Dict[str, float], status: str):
"""Log test results."""
with open(self.log_file, "a", newline="") as f:
writer = csv.DictWriter(f, fieldnames=[
"timestamp", "commit_sha", "branch",
"faithfulness", "answer_relevance",
"context_precision", "context_recall",
"status"
])
writer.writerow({
"timestamp": datetime.now().isoformat(),
"commit_sha": commit_sha[:8],
"branch": branch,
"faithfulness": f"{scores.get('faithfulness', 0):.3f}",
"answer_relevance": f"{scores.get('answer_relevance', 0):.3f}",
"context_precision": f"{scores.get('context_precision', 0):.3f}",
"context_recall": f"{scores.get('context_recall', 0):.3f}",
"status": status
})
def plot_trends(self) -> str:
"""Generate trend plot (returns matplotlib figure as PNG)."""
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(self.log_file)
df["timestamp"] = pd.to_datetime(df["timestamp"])
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for ax, metric in zip(axes.flat,
["faithfulness", "answer_relevance",
"context_precision", "context_recall"]):
df.plot(x="timestamp", y=metric, ax=ax, legend=True)
ax.set_title(f"{metric} over time")
ax.set_ylabel("Score")
plt.tight_layout()
plt.savefig("regression_trends.png", dpi=100)
return "regression_trends.png"
Key Takeaways
- Establish a baseline of metrics on a golden dataset before making changes.
- Create a regression test harness that compares new metrics to the baseline.
- Define per-metric regression thresholds (e.g., allow 5% drop in faithfulness).
- Integrate regression tests into CI/CD to catch quality degradation before deployment.
- Track metrics over time to identify trends and debug quality issues early.
Frequently Asked Questions
What regression thresholds should I use?
This depends on your use case. For production medical or legal RAG, be strict: 2–3% threshold. For experimental systems, allow 5–10%. Start conservative; adjust based on observed variance.
Can I run regression tests locally before pushing?
Yes. Use the same test harness locally. Hook it into a pre-commit script to fail the commit if metrics regress.
What if a regression is intentional (e.g., trading faithfulness for latency)?
Document it. In the regression test output, allow explicit override with an explanation. For intentional regressions, increase the threshold or manually approve the change.
How often should I rebaseline?
Rebaseline when you make intentional, approved changes (model upgrade, retriever improvement, prompt redesign). Version-control baselines alongside code. Never silently update baselines—always record the reason.
Further Reading
- Continuous Deployment and Testing for Machine Learning (Polyzotis et al., 2019) — Best practices for ML testing pipelines.
- MLOps: Best Practices for Machine Learning Operations (Cloud Computing Standards) — Industry standards for ML system testing and monitoring.
- Regression Testing for Natural Language Processing Systems — Specific guidance on NLP regression testing strategies.