Integrating LLM Evaluation into CI/CD Pipelines
In 2026, shipping an LLM change without running evaluation is reckless. Yet many teams still do: they modify a prompt, deploy it, and find out it broke something in production. Integrating evaluation into your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins) means every change runs through deterministic checks, metrics, and LLM-as-judge before it can be merged. This catches regressions in minutes, not weeks.
This article teaches you to set up evaluation gates in CI/CD, define pass/fail thresholds based on golden data, implement parallel evaluation for speed, and automate quality gates that prevent deployment of degraded models.
CI/CD Pipeline Architecture for LLM Evaluation
A typical CI/CD flow for LLM changes:
Commit to branch
↓
Deterministic checks (format, safety, syntax)
↓ (if all pass)
Fast metrics (ROUGE, semantic similarity)
↓ (if metric > threshold)
LLM-as-judge sampling (top 10 uncertain examples)
↓ (if judge score > threshold)
Optional: pairwise tournament vs. production model
↓ (if tournament passes)
APPROVED TO MERGE
At each step, failures block merge and post comments on the PR. Successes proceed to the next stage.
GitHub Actions Workflow for LLM Evaluation
Here's a production-ready GitHub Actions workflow:
name: LLM Evaluation Gate
on:
pull_request:
paths:
- 'src/prompts/**'
- 'src/models/**'
- 'evaluation/**'
jobs:
evaluate:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Full history for diffing
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install anthropic openai
- name: Run deterministic checks
run: |
python evaluation/run_deterministic_checks.py \
--golden-dataset data/golden_dataset.jsonl \
--output results/deterministic_checks.json
- name: Run metrics evaluation
run: |
python evaluation/run_metrics.py \
--golden-dataset data/golden_dataset.jsonl \
--output results/metrics.json
- name: Check metrics thresholds
run: |
python evaluation/check_thresholds.py \
--metrics results/metrics.json \
--baseline-metrics results/baseline_metrics.json \
--threshold-regression 0.02 \
--output results/threshold_check.json
- name: Run LLM-as-judge on uncertain examples
if: steps.metrics.outcome == 'success'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python evaluation/run_judge_sampling.py \
--golden-dataset data/golden_dataset.jsonl \
--metrics results/metrics.json \
--sample-size 20 \
--output results/judge_scores.json
- name: Post evaluation results to PR
if: always()
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const checks = JSON.parse(fs.readFileSync('results/deterministic_checks.json', 'utf8'));
const metrics = JSON.parse(fs.readFileSync('results/metrics.json', 'utf8'));
const judge = JSON.parse(fs.readFileSync('results/judge_scores.json', 'utf8'));
let comment = '## LLM Evaluation Results\n\n';
comment += `**Deterministic Checks:** ${checks.passed_all ? '✓ PASS' : '✗ FAIL'}\n`;
comment += `**Metrics:** Semantic Sim = ${metrics.mean_semantic_sim.toFixed(3)}\n`;
comment += `**LLM Judge:** ${judge.mean_score.toFixed(1)}/10\n`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
- name: Determine pass/fail
run: |
python evaluation/check_all_gates.py \
--deterministic results/deterministic_checks.json \
--metrics results/metrics.json \
--judge results/judge_scores.json \
--required-metrics-score 0.70 \
--required-judge-score 7.0 \
--exit-on-fail
Key features:
- Parallel stages: Deterministic checks run first (fastest), then metrics, then judge (if needed).
- Conditional steps: Only run expensive judge evaluation if metrics pass.
- PR comments: Automatically post results for visibility.
- Fail-fast: Block merge if any gate fails.
Threshold Selection and Regression Detection
Set thresholds based on your golden dataset baseline. Regression = performance drops below threshold.
def set_evaluation_thresholds(
golden_dataset_results: dict,
acceptable_regression_pct: float = 2.0
) -> dict:
"""
Set pass/fail thresholds based on golden dataset performance.
acceptable_regression_pct: allow this much regression before failing (e.g., 2%)
"""
import numpy as np
thresholds = {
'semantic_similarity': {
'baseline': np.mean(golden_dataset_results['semantic_sim']),
'std_dev': np.std(golden_dataset_results['semantic_sim']),
'min_acceptable': 0 # Set below
},
'rouge_l': {
'baseline': np.mean(golden_dataset_results['rouge_l']),
'std_dev': np.std(golden_dataset_results['rouge_l']),
'min_acceptable': 0
},
'judge_score': {
'baseline': np.mean(golden_dataset_results['judge_scores']),
'std_dev': np.std(golden_dataset_results['judge_scores']),
'min_acceptable': 0
}
}
# Min acceptable = baseline - acceptable_regression_pct
for metric in thresholds:
baseline = thresholds[metric]['baseline']
regression_delta = baseline * (acceptable_regression_pct / 100.0)
thresholds[metric]['min_acceptable'] = baseline - regression_delta
return thresholds
def check_regression(
new_results: dict,
thresholds: dict
) -> dict:
"""
Evaluate new model against thresholds.
Returns: per-metric regression status + overall pass/fail.
"""
import numpy as np
regression_results = {
'metrics': {},
'passed': True,
'summary': []
}
for metric_name, metric_data in new_results.items():
if metric_name not in thresholds:
continue
new_mean = np.mean(metric_data)
min_acceptable = thresholds[metric_name]['min_acceptable']
passed = new_mean >= min_acceptable
regression_pct = ((new_mean - thresholds[metric_name]['baseline']) /
thresholds[metric_name]['baseline'] * 100)
regression_results['metrics'][metric_name] = {
'baseline': thresholds[metric_name]['baseline'],
'new_mean': new_mean,
'min_acceptable': min_acceptable,
'passed': passed,
'regression_pct': regression_pct
}
if not passed:
regression_results['passed'] = False
regression_results['summary'].append(
f"{metric_name}: REGRESSION ({regression_pct:.1f}%)"
)
else:
regression_results['summary'].append(
f"{metric_name}: OK ({regression_pct:+.1f}%)"
)
return regression_results
Thresholds prevent silent regressions. A 2% regression on semantic similarity might seem minor, but if it happens on every merge, quality degrades fast. Catch it with automated gates.
Parallel Evaluation for Speed
Evaluating 500 examples serially takes hours. Parallelize: split examples into N workers, evaluate in parallel, aggregate results.
import asyncio
from typing import List, Dict
async def evaluate_batch_parallel(
examples: List[dict],
num_workers: int = 4,
metrics_to_compute: List[str] = ['semantic_sim', 'rouge_l']
) -> Dict:
"""
Evaluate a batch of examples in parallel.
num_workers: number of concurrent evaluation tasks.
"""
# Split examples into chunks for workers
chunk_size = len(examples) // num_workers
chunks = [
examples[i*chunk_size:(i+1)*chunk_size]
for i in range(num_workers)
]
async def evaluate_chunk(chunk):
results = []
for example in chunk:
metrics = {}
if 'semantic_sim' in metrics_to_compute:
# Compute in parallel too (network calls, not CPU-bound)
metrics['semantic_sim'] = await compute_semantic_sim_async(
example['output'],
example['reference']
)
if 'rouge_l' in metrics_to_compute:
metrics['rouge_l'] = compute_rouge_l(
example['output'],
example['reference']
)
results.append({'example_id': example['id'], **metrics})
return results
# Run all chunks concurrently
chunk_results = await asyncio.gather(*[
evaluate_chunk(chunk) for chunk in chunks
])
# Flatten results
all_results = []
for chunk_result in chunk_results:
all_results.extend(chunk_result)
return {
'results': all_results,
'total_examples': len(examples),
'mean_semantic_sim': sum(r.get('semantic_sim', 0) for r in all_results) / len(all_results),
'mean_rouge_l': sum(r.get('rouge_l', 0) for r in all_results) / len(all_results)
}
async def compute_semantic_sim_async(output, reference):
"""Async wrapper for semantic similarity (network call)."""
# Call embedding API asynchronously
pass
Parallel evaluation reduces CI time from hours to minutes. Use multiprocessing for CPU-bound tasks (metrics), asyncio for I/O-bound tasks (LLM calls, embedding API).
Caching Evaluation Results
Don't re-evaluate unchanged examples. Cache results by content hash.
import hashlib
import json
class EvaluationCache:
"""Cache evaluation results keyed by example content hash."""
def __init__(self, cache_file: str = 'evaluation_cache.json'):
self.cache_file = cache_file
self.cache = self._load_cache()
def _load_cache(self) -> dict:
try:
with open(self.cache_file) as f:
return json.load(f)
except FileNotFoundError:
return {}
def _save_cache(self):
with open(self.cache_file, 'w') as f:
json.dump(self.cache, f)
def get_example_hash(self, example: dict) -> str:
"""Compute hash of example (input + reference)."""
content = json.dumps({
'input': example['input'],
'reference': example['reference']
}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get(self, example: dict) -> dict:
"""Retrieve cached metrics if available."""
key = self.get_example_hash(example)
return self.cache.get(key, None)
def set(self, example: dict, metrics: dict):
"""Store metrics in cache."""
key = self.get_example_hash(example)
self.cache[key] = metrics
self._save_cache()
# Usage in CI
cache = EvaluationCache()
for example in golden_dataset:
cached = cache.get(example)
if cached:
metrics = cached # Use cached result
else:
metrics = compute_all_metrics(example)
cache.set(example, metrics)
Caching can reduce evaluation time by 50%+ (most examples don't change between commits). Bust cache only on model or prompt changes.
Key Takeaways
- Gate every deployment: Evaluation in CI/CD catches regressions immediately, not after deployment.
- Use staged gates: Deterministic checks → fast metrics → expensive judge. Stop early if possible.
- Set thresholds based on baseline: Automatic regression detection prevents degradation.
- Parallelize evaluation: 500 examples in parallel = 10 minutes vs. 5 hours serially.
- Cache evaluation results: Most examples don't change; reuse cached results from prior commits.
Frequently Asked Questions
How long should CI/CD evaluation take?
Aim for <10 minutes. If deterministic checks + metrics exceed 10 minutes, you're not sampling—you're evaluating the full golden dataset every commit (too expensive). Sample 500 examples; evaluate full set only on main branch or before release.
Should I block merge on metrics regression?
Yes, if acceptable_regression_pct is set carefully. A 1% regression threshold catches real problems; 5% is too lenient. Document and revisit thresholds monthly as baselines shift.
What if evaluation is flaky (metric varies 5% between runs)?
Your metric is noisy. Increase sample size, reduce metric noise (semantic similarity is noisy; ROUGE is stable), or both. Flaky tests are worse than no tests—they train developers to ignore failures.
Can I use production data for evaluation?
Yes, but carefully. Production data is gold for detecting regressions, but it's not controlled and may expose privacy information. Use sampled, anonymized production data as a separate evaluation set from your golden dataset.
How do I handle LLM-as-judge model updates?
When you upgrade the judge model (GPT-3.5 to GPT-4), re-calibrate on golden dataset. Compare new judge to old judge; if correlation is <0.80, update thresholds. Document judge version in CI logs.
Further Reading
- GitHub Actions for ML/AI Workflows — Pre-built actions for ML pipelines.
- Testing ML Systems: Continuous Integration for Machine Learning — Academic perspective on CI for ML/LLMs.
- Automated ML Pipeline Best Practices — Google's MLOps approach.
- Regression Detection in Production Models — Detecting and responding to production drift.