Monitoring Prompt Performance: Metrics and Alerts
Production monitoring is the continuous observation of prompt behavior after deployment. Without monitoring, you won't know if a prompt is degrading until users complain. With monitoring, you detect regressions within minutes, set automatic alerts, and make data-driven decisions about rollback.
Prompt monitoring answers: (1) Is the prompt performing as expected? (2) When did performance change? (3) Why did it change? The answers drive fast rollbacks, debugging, and confidence in new deployments.
Key Metrics to Monitor
Latency: How long does inference take? Track p50, p99. Alert if latency increases 20%+ (indicates model change or context bloat).
Error rate: % of requests that fail (timeouts, API errors). Alert if error rate exceeds 1%.
Token usage: Average input and output tokens. Alert if input tokens spike (indicates context bloat); alert if output tokens drop below baseline (indicates model or prompt change).
Quality metrics (domain-specific):
- Customer support: refund approval rate, customer satisfaction (post-interaction ratings), escalation rate.
- Content moderation: false-positive rate (incorrectly flagged content), false-negative rate (missed violations).
- Summarization: length-to-target ratio, readability score, factuality (if human-reviewed).
Cost: Inferred from token usage and model pricing. Alert if cost per request exceeds budget.
class PromptMetricsCollector:
"""Collect runtime metrics for prompts."""
def log_inference(self,
prompt_version: str,
user_id: str,
latency_ms: float,
input_tokens: int,
output_tokens: int,
success: bool,
error_message: Optional[str] = None):
"""Log an inference event."""
timestamp = datetime.now().isoformat()
# Base metrics (always collected)
metrics = {
"prompt_version": prompt_version,
"timestamp": timestamp,
"latency_ms": latency_ms,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": input_tokens + output_tokens,
"success": success,
"error": error_message
}
# Send to time-series database (e.g., Prometheus, CloudWatch)
self._emit_to_tsdb(metrics)
def log_outcome(self,
prompt_version: str,
user_id: str,
outcome_name: str,
outcome_value: float,
metadata: Optional[dict] = None):
"""Log a quality outcome (e.g., satisfaction, accuracy)."""
self._emit_to_tsdb({
"prompt_version": prompt_version,
"outcome": outcome_name,
"value": outcome_value,
"timestamp": datetime.now().isoformat(),
"metadata": metadata
})
def _emit_to_tsdb(self, metrics: dict):
"""Send metrics to a time-series database."""
# Implementation depends on your TSDB (Prometheus, CloudWatch, DataDog, etc.)
pass
Dashboards: Visualizing Prompt Health
Create dashboards showing prompt performance over time:
## Prompt Monitoring Dashboard: customer-support
### Real-time Summary (last 1 hour)
- Version deployed: v2.0.0 (promoted 2026-06-01 10:00 UTC)
- Requests: 12,450
- Success rate: 99.8% (10 failures)
- Avg latency: 2150 ms (p99: 3100 ms)
- Avg satisfaction (post-interaction): 4.12 / 5.0
- Refund approval rate: 24.3%
### Trend: Latency (p99)
[Time-series graph showing p99 latency over 24 hours]
- 24h ago: 2200 ms
- Now: 3100 ms
- Trend: UP (40% increase)
- Status: WARNING (exceeds 3000 ms threshold)
### Trend: Customer Satisfaction
[Graph showing mean satisfaction rating]
- Stable at 4.1–4.2 / 5.0
- Status: GREEN
### Trend: Refund Approval Accuracy
[Graph showing % of approvals later confirmed correct]
- v1.9.0: 75% accuracy (baseline)
- v2.0.0: 80% accuracy (improvement)
- Status: GREEN
### Segment Breakdown
[Tables showing metrics by user segment]
- New users: satisfaction 4.0, approval accuracy 78%
- Returning users: satisfaction 4.3, approval accuracy 82%
Alert Rules
Define alerts based on business impact:
# alerts.yaml
alerts:
- name: "prompt-latency-spike"
description: "P99 latency increased 20%+ over baseline"
condition: "latency_p99 > baseline * 1.2"
for_duration: "5 minutes"
threshold: "if true for 5+ min, alert"
actions:
- "notify #llmops-team on Slack"
- "create PagerDuty incident"
- "check recent prompt/model changes"
- name: "prompt-error-rate"
description: "Error rate exceeds 2%"
condition: "error_rate_pct > 2.0"
for_duration: "2 minutes"
actions:
- "notify on-call engineer"
- "trigger automated rollback if sustained > 5 min"
- name: "quality-metric-regression"
description: "Customer satisfaction drops > 0.2 points"
condition: "satisfaction_mean < baseline - 0.2"
for_duration: "30 minutes"
actions:
- "notify product + ML team"
- "suggest rollback review"
- name: "cost-spike"
description: "Token usage per request increases > 15%"
condition: "avg_tokens > baseline * 1.15"
for_duration: "1 hour"
actions:
- "notify finance team"
- "investigate prompt bloat"
Implement alerts in code:
class AlertManager:
def __init__(self, metrics_store, notifier):
self.metrics = metrics_store
self.notifier = notifier
def check_latency_alert(self, prompt_version: str, baseline_ms: float):
"""Alert if latency spikes."""
current_latency = self.metrics.get_p99_latency(prompt_version, time_window_min=5)
threshold = baseline_ms * 1.2
if current_latency > threshold:
self.notifier.alert(
title=f"Latency spike: {prompt_version}",
message=f"P99 latency {current_latency}ms exceeds threshold {threshold}ms",
severity="warning",
channels=["#llmops-team"]
)
def check_error_rate_alert(self, prompt_version: str):
"""Alert if error rate exceeds threshold."""
error_rate = self.metrics.get_error_rate(prompt_version, time_window_min=2)
if error_rate > 0.02: # 2%
self.notifier.alert(
title=f"High error rate: {prompt_version}",
message=f"Error rate {error_rate:.2%} exceeds 2% threshold",
severity="critical",
channels=["on-call", "#llmops-alerts"],
trigger_autorollback=True
)
def check_quality_alert(self, prompt_version: str, baseline_satisfaction: float):
"""Alert if quality metric regresses."""
current_satisfaction = self.metrics.get_mean_satisfaction(
prompt_version, time_window_min=30
)
if current_satisfaction < baseline_satisfaction - 0.2:
self.notifier.alert(
title=f"Quality regression: {prompt_version}",
message=f"Satisfaction {current_satisfaction:.2f} dropped from baseline {baseline_satisfaction:.2f}",
severity="warning",
channels=["#product", "#llmops-team"]
)
# Usage
alert_mgr = AlertManager(metrics_store, notifier)
# Run every minute
def monitor_loop():
prompt_version = "customer-support:v2.0.0"
alert_mgr.check_latency_alert(prompt_version, baseline_ms=2200)
alert_mgr.check_error_rate_alert(prompt_version)
alert_mgr.check_quality_alert(prompt_version, baseline_satisfaction=4.1)
Tracking Version Deployments
Log every version change for correlation with metric shifts:
class DeploymentLog:
"""Track prompt version deployments."""
def log_deployment(self,
prompt_name: str,
old_version: str,
new_version: str,
deployed_by: str,
environment: str,
reason: str):
"""Log a version change."""
self.db.insert("deployments", {
"prompt_name": prompt_name,
"old_version": old_version,
"new_version": new_version,
"deployed_by": deployed_by,
"deployed_at": datetime.now().isoformat(),
"environment": environment,
"reason": reason
})
def get_deployments_in_window(self,
prompt_name: str,
start_time: str,
end_time: str) -> list:
"""Get all deployments in a time window (for correlation)."""
return self.db.query(
"SELECT * FROM deployments WHERE prompt_name = ? AND deployed_at BETWEEN ? AND ?",
[prompt_name, start_time, end_time]
)
# Usage: when deploying a version
deploy_log = DeploymentLog()
deploy_log.log_deployment(
prompt_name="customer-support",
old_version="1.9.0",
new_version="2.0.0",
deployed_by="[email protected]",
environment="production",
reason="A/B test passed; accuracy +5%"
)
Anomaly Detection
Automatically detect unusual patterns:
import numpy as np
from scipy import stats
class AnomalyDetector:
def __init__(self, metrics_store):
self.metrics = metrics_store
def detect_anomalies(self, prompt_version: str, metric_name: str,
time_window_min: int = 60,
sensitivity: float = 2.0) -> dict:
"""
Detect anomalies in a metric using Z-score.
sensitivity: number of standard deviations (2.0 = 2-sigma alert)
"""
# Fetch historical data (e.g., last 7 days)
historical = self.metrics.get_historical(
prompt_version, metric_name, days=7
)
# Fetch recent data (last `time_window_min` minutes)
recent = self.metrics.get_recent(
prompt_version, metric_name, time_window_min=time_window_min
)
# Compute baseline (mean and std of historical)
baseline_mean = np.mean(historical)
baseline_std = np.std(historical)
# Compute recent mean
recent_mean = np.mean(recent)
# Z-score
z_score = (recent_mean - baseline_mean) / baseline_std
is_anomaly = abs(z_score) > sensitivity
return {
"is_anomaly": is_anomaly,
"z_score": z_score,
"baseline_mean": baseline_mean,
"baseline_std": baseline_std,
"recent_mean": recent_mean,
"deviation": recent_mean - baseline_mean,
"deviation_pct": ((recent_mean - baseline_mean) / baseline_mean) * 100
}
# Usage
detector = AnomalyDetector(metrics_store)
result = detector.detect_anomalies(
prompt_version="customer-support:v2.0.0",
metric_name="latency_p99_ms",
time_window_min=10,
sensitivity=2.0
)
if result["is_anomaly"]:
print(f"ANOMALY DETECTED: Latency increased {result['deviation_pct']:.1f}%")
print(f"Z-score: {result['z_score']:.2f}")
Key Takeaways
- Monitor latency, error rate, token usage, and domain-specific quality metrics continuously.
- Set alerts on thresholds and deviations; trigger rollbacks if critical.
- Correlate metric shifts with version deployments to understand causes.
- Use anomaly detection (Z-scores) to auto-flag unusual patterns.
- Build dashboards to visualize prompt health and trends.
Frequently Asked Questions
How often should I check metrics?
For latency and error rate: every 1–5 minutes. For quality metrics (satisfaction, accuracy): every 10–30 minutes (they change more slowly due to lag in collecting user feedback).
Should I use static thresholds or dynamic (baseline-relative) thresholds?
Use dynamic thresholds when baselines vary (e.g., satisfaction varies by day of week). Use static thresholds for absolute requirements (e.g., error rate must be < 1%).
Can I auto-rollback on alert?
Yes, but carefully. Reserve auto-rollback for critical alerts (error rate > 5%, latency p99 > 10x baseline). For quality alerts, notify humans first.
What if a metric is noisy?
Use a longer time window (e.g., 10-minute average instead of 1-minute) or smooth with a moving average.
# Smooth with exponential moving average
ema = 0.7 * current + 0.3 * ema_previous
How do I distinguish a real problem from normal variation?
Use statistical tests. E.g., compare recent mean to historical mean with a t-test; alert only if p < 0.05.
Further Reading
- Prometheus: Time-Series Metrics Monitoring — Open-source metrics collection and alerting.
- Datadog Monitoring and Observability — Commercial tool for monitoring LLM systems.
- Detecting Anomalies with Z-Scores — Statistical foundations.
- SLO and SLI: Service Level Objectives — Google SRE guide to defining reliability targets.