Skip to main content

Model versioning and promotion strategies

Model versioning and promotion is the process of assigning version identifiers to model checkpoints (trained weights, configuration, and tokenizer), staging them across environments (dev, staging, production), and controlling which version is active in each environment. Semantic versioning (MAJOR.MINOR.PATCH) tracks breaking changes, new capabilities, and bug fixes. Promotion workflows gate transitions between environments with approval, testing, and rollout strategies. Promotion strategies answer three questions: when do you update a model, who approves it, and how do you validate it is safe before production? Well-designed versioning and promotion reduces the blast radius of bad model updates and enables quick rollback if issues arise.

Semantic Versioning for Models

Adapt semantic versioning to LLMs by treating training runs, fine-tuning, and config changes as version increments.

  • MAJOR version (e.g., 3.0.0): Breaking change in model behavior, output format, or prompt compatibility. Requires code changes in consuming applications. Example: switching from Claude 2 to Claude 3 requires updating prompt engineering.
  • MINOR version (e.g., 2.3.0): New capability or improvement (better accuracy, faster inference, lower cost) that is backward-compatible. Existing prompts continue to work but may benefit from optimization. Example: Claude 3.5 Sonnet improves reasoning without breaking existing code.
  • PATCH version (e.g., 2.2.1): Bug fix or minor optimization (lower latency, reduced hallucinations) with no observable change to output semantics. Transparent upgrade, no code changes needed. Example: fixing a tokenizer bug.
# models/manifest.yaml - Model registry
models:
- name: customer-support
current_production: "claude-3-5-sonnet-v1.2.3"
staging: "claude-3-5-sonnet-v1.3.0-rc.1"
dev: "claude-opus-v4.0.0-rc.1"

- name: code-generator
current_production: "claude-3-5-sonnet-v1.2.3"
staging: "claude-3-5-sonnet-v1.2.3"
dev: "claude-opus-v4.0.0"

version_history:
- version: "claude-3-5-sonnet-v1.3.0-rc.1"
base_model: "claude-3-5-sonnet"
training_date: "2026-05-15"
status: "release_candidate"
changelog: "Improved accuracy on math reasoning by 7%, latency -200ms"

- version: "claude-3-5-sonnet-v1.2.3"
base_model: "claude-3-5-sonnet"
training_date: "2026-04-10"
status: "stable"
changelog: "Fixed tokenizer edge case on multi-byte characters"

Promotion Workflow: Dev → Staging → Production

Implement a three-tier promotion process. Every new model version starts in dev, passes testing, moves to staging for A/B testing, and finally reaches production.

# .github/workflows/promote-model.yml
name: Promote Model
on:
workflow_dispatch:
inputs:
model_name:
description: "Model name (e.g., customer-support)"
required: true
version:
description: "Version to promote (e.g., claude-3-5-sonnet-v1.3.0)"
required: true
target_env:
description: "Target environment (staging or production)"
required: true
type: choice
options: [staging, production]

jobs:
promote:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Validate version exists in dev
run: |
# Check version is deployed in dev
DEV_VERSION=$(grep "dev:" models/manifest.yaml | grep ${{ inputs.model_name }} | awk '{print $2}')
if [ "$DEV_VERSION" != "${{ inputs.version }}" ]; then
echo "Version not found in dev environment"
exit 1
fi

- name: Run promotion tests
run: |
# Run full eval suite against version
python -m pytest tests/eval_suite.py \
--model ${{ inputs.model_name }} \
--version ${{ inputs.version }} \
--threshold 0.85

- name: Update manifest
if: success()
run: |
# Update manifest to promote version
sed -i "s/${{ inputs.target_env }}:.*/${{ inputs.target_env }}: \"${{ inputs.version }}\"/g" models/manifest.yaml
git config user.email "[email protected]"
git config user.name "CI Bot"
git add models/manifest.yaml
git commit -m "promote: ${{ inputs.model_name }} to ${{ inputs.target_env }} v${{ inputs.version }}"
git push

- name: Trigger deployment
if: success() && inputs.target_env == 'production'
run: |
# Trigger canary deploy for production
curl -X POST https://api.company.com/deploy \
-H "Authorization: Bearer ${{ secrets.DEPLOY_TOKEN }}" \
-d '{
"model": "${{ inputs.model_name }}",
"version": "${{ inputs.version }}",
"strategy": "canary",
"canary_percentage": 5
}'

Environment-Specific Configuration

Different environments may have different requirements. Dev can use high-cost models and experimental prompts. Staging mirrors production but with staging API credentials. Production uses validated, cost-optimized models.

# config/model_config.py
from dataclasses import dataclass

@dataclass
class ModelConfig:
environment: str # dev, staging, production
model_version: str # from manifest.yaml
max_tokens: int
temperature: float
timeout_ms: int
enable_caching: bool
fallback_model: str

CONFIG = {
"dev": ModelConfig(
environment="dev",
model_version="claude-opus-v4.0.0-rc.1", # latest experimental
max_tokens=2048,
temperature=0.7,
timeout_ms=5000,
enable_caching=False,
fallback_model="claude-3-5-sonnet-v1.2.3"
),
"staging": ModelConfig(
environment="staging",
model_version="claude-3-5-sonnet-v1.3.0-rc.1", # candidate for production
max_tokens=1024,
temperature=0.5,
timeout_ms=3000,
enable_caching=True,
fallback_model="claude-3-5-sonnet-v1.2.3"
),
"production": ModelConfig(
environment="production",
model_version="claude-3-5-sonnet-v1.2.3", # stable, well-tested
max_tokens=1024,
temperature=0.3,
timeout_ms=2000,
enable_caching=True,
fallback_model="claude-3-sonnet-20240229" # older fallback
)
}

def get_config(env: str) -> ModelConfig:
return CONFIG[env]

A/B Testing During Staging

Before promoting a model to production, run an A/B test in staging: route 50% of requests to the new model and 50% to the current production model, compare metrics, and decide whether to promote.

import random
from config.model_config import get_config

def route_request_for_ab_test(request_id: str, model_name: str, test_group: str) -> str:
"""Route request to control (old model) or treatment (new model) group."""
# Hash request ID to ensure consistent routing
hash_value = hash(f"{request_id}-{test_group}") % 100

config = get_config("staging")

if hash_value < 50:
# Control group: use production model
return "claude-3-5-sonnet-v1.2.3"
else:
# Treatment group: use new model
return config.model_version

def log_ab_test_metrics(request_id: str, model_version: str, metrics: dict):
"""Log metrics for A/B test analysis."""
# Send to analytics/monitoring system
print(f"AB_TEST: request_id={request_id}, model={model_version}, latency={metrics['latency_ms']}, quality={metrics['quality_score']}")

# Example usage
request_id = "req-12345"
model = route_request_for_ab_test(request_id, "customer-support", "v1.3.0")
result = call_model(model)
log_ab_test_metrics(request_id, model, {"latency_ms": 450, "quality_score": 0.87})

Automated Rollback Triggers

Define conditions that automatically roll back a model version to the previous one.

def should_rollback(metrics: dict, thresholds: dict) -> bool:
"""Check if metrics indicate a rollback is needed."""
checks = [
("latency_p99", metrics.get("latency_p99_ms", 0) > thresholds.get("latency_p99_ms", 2000)),
("error_rate", metrics.get("error_rate", 0) > thresholds.get("error_rate", 0.01)),
("quality_score", metrics.get("quality_score", 1.0) < thresholds.get("quality_score", 0.80)),
("hallucination_rate", metrics.get("hallucination_rate", 0) > thresholds.get("hallucination_rate", 0.05)),
("cost_per_request", metrics.get("cost_per_request_cents", 0) > thresholds.get("cost_per_request_cents", 2.0))
]

failed_checks = [name for name, failed in checks if failed]

if len(failed_checks) >= 2: # rollback if 2+ metrics fail
print(f"ROLLBACK TRIGGERED: {failed_checks}")
return True
return False

Key Takeaways

  • Use semantic versioning (MAJOR.MINOR.PATCH) for model versions to track breaking changes, improvements, and bug fixes.
  • Implement a three-tier promotion workflow: dev (experimental) → staging (candidate) → production (stable).
  • Store model versions and promotion history in a manifest file (manifest.yaml) committed to Git.
  • Run A/B tests in staging before promoting to production, comparing new and old models on live traffic.
  • Define automated rollback triggers based on quality, latency, error rate, and cost metrics.

Frequently Asked Questions

Should I version the prompt and model separately or together?

Version them together. A model version paired with a specific prompt is what you deploy. Use model-version_prompt-version naming (e.g., claude-3-5-sonnet-v1.2.3_prompt-v4). Store both in the manifest and promote them as a unit.

How long should I keep old model versions in production as fallbacks?

Keep the previous two versions (current and current-1) as fallbacks. Retire older versions after 30 days of successful deployment. Document fallback chains: if v1.3.0 fails, fall back to v1.2.3; if v1.2.3 fails, fall back to v1.1.0.

What should I do if a model version fails its promotion test?

Root-cause the failure. Was the model undertrained? Is the test itself flaky? Fix the issue and re-run. If the model is fundamentally unfit (lower accuracy than expected), retire it and train a new version. Log the failure in your model registry so you avoid retrying it.

Can I run multiple model versions in production simultaneously?

Yes, via shadow traffic or A/B testing. Shadow traffic sends a copy of production requests to the new model (not used by users) so you can validate quality offline. A/B testing sends a percentage of real traffic to the new model. Both generate data for informed promotion decisions.

How do I handle a model version that is production-critical but requires a code change?

Mark it as a MAJOR version bump. Create a feature branch that updates consuming code (prompts, API calls, output parsing). Test thoroughly in dev. Only after code and model are aligned, promote together to staging and production. Coordinate with the team so deployments are synchronized.

Further Reading