Skip to main content

Alerting on LLM Degradation: Costs, Latency, Quality

Alerting on LLM performance means defining threshold-based rules that detect cost spikes, latency increases, and quality degradation, then triggering notifications so on-call engineers can investigate and remediate. Unlike traditional alerting that focuses on availability (is the service up?), LLM alerting focuses on degradation: cost per inference might double due to an inefficient prompt; latency might increase due to model saturation; output quality might drop silently if a model parameter was misconfigured. This article walks you through designing alert rules, setting appropriate thresholds, and integrating alerting with incident response workflows.

Alert Dimensions for LLM Apps

Cost Monitoring

Track total and per-unit costs to detect runaway token consumption:

# Example: Detect daily cost anomaly
# Rule: Alert if today's cost exceeds 150% of the 30-day average

import statistics
from datetime import datetime, timedelta

def check_daily_cost_anomaly(daily_costs: list[float], threshold_pct: float = 1.5):
"""Detect cost spikes above rolling average."""

if len(daily_costs) < 30:
return False, None # Need 30 days of history

# Calculate 30-day rolling average
avg_30d = statistics.mean(daily_costs[-30:])
today_cost = daily_costs[-1]

# Check if today exceeds threshold
if today_cost > avg_30d * threshold_pct:
spike_pct = ((today_cost - avg_30d) / avg_30d) * 100
return True, f"Cost spike: ${today_cost:.2f} ({spike_pct:.0f}% above 30-day avg of ${avg_30d:.2f})"

return False, None

# Example
daily_costs = [10.50] * 28 + [14.75] + [45.00] # Normal costs, then spike to $45
is_anomaly, message = check_daily_cost_anomaly(daily_costs)
if is_anomaly:
print(f"ALERT: {message}")

Latency Monitoring

Track percentile latencies (p50, p95, p99) to detect degradation:

# Prometheus alert: Latency p99 increases by >50%
- alert: HighLatencyDegradation
expr: |
histogram_quantile(0.99, rate(llm_latency_seconds_bucket[5m])) >
histogram_quantile(0.99, rate(llm_latency_seconds_bucket[1h] offset 1h)) * 1.5
for: 10m
annotations:
summary: "LLM latency p99 degraded by >50%"
description: "Current p99: {{ $value | humanizeDuration }}. Investigate API provider status, model load, and network latency."

Quality Monitoring

For applications that score outputs (relevance, correctness, safety), alert on quality degradation:

# Example: Alert if average quality score drops below threshold

def check_quality_degradation(recent_scores: list[float], window_hours: int = 1):
"""Detect output quality degradation."""

import statistics

if not recent_scores:
return False, None

# Calculate average quality in the last 1 hour
current_avg = statistics.mean(recent_scores[-100:]) # Assume ~100 inferences/hour

# Define acceptable quality threshold (e.g., avg score >= 7.0 out of 10)
quality_threshold = 7.0

if current_avg < quality_threshold:
return True, f"Quality degradation: avg score {current_avg:.2f} below threshold {quality_threshold}"

return False, None

Error Rate Monitoring

Track error rates by category to detect systematic failures:

# Alert: Parsing error rate exceeds 5% (indicates prompt/output mismatch)
- alert: HighParsingErrorRate
expr: |
(increase(llm_errors_total{category="parsing_error"}[5m]) /
increase(llm_calls_total[5m])) > 0.05
for: 5m
annotations:
summary: "Parsing error rate exceeds 5%"
description: "{{ $value | humanizePercentage }} of responses are unparseable. Check prompt format and model behavior."

# Alert: API errors exceed 1%
- alert: HighAPIErrorRate
expr: |
(increase(llm_errors_total{category="api_error"}[5m]) /
increase(llm_calls_total[5m])) > 0.01
for: 5m
annotations:
summary: "API error rate exceeds 1%"
description: "Check LLM API provider status page and credential validity."

Threshold Setting Strategy

Setting effective alert thresholds prevents false positives (noisy alerts) and false negatives (missed issues).

Approach 1: Static Thresholds

Simple and transparent but requires manual tuning:

# Define static thresholds
THRESHOLDS = {
"cost_per_inference_usd": 0.05, # Alert if single call costs > $0.05
"latency_p99_seconds": 3.0, # Alert if p99 latency > 3 seconds
"error_rate_pct": 1.0, # Alert if error rate > 1%
"quality_score_min": 7.0, # Alert if avg quality < 7.0
}

def check_static_thresholds(metrics: dict) -> list[str]:
"""Check metrics against static thresholds."""

alerts = []

if metrics.get("cost_per_inference") > THRESHOLDS["cost_per_inference_usd"]:
alerts.append(f"Cost alert: ${metrics['cost_per_inference']:.4f} > ${THRESHOLDS['cost_per_inference_usd']}")

if metrics.get("latency_p99") > THRESHOLDS["latency_p99_seconds"]:
alerts.append(f"Latency alert: {metrics['latency_p99']:.2f}s > {THRESHOLDS['latency_p99_seconds']}s")

return alerts

Approach 2: Dynamic/Adaptive Thresholds

Thresholds adjust based on recent history (e.g., 2 standard deviations above recent mean):

import statistics

def check_adaptive_thresholds(metric_history: list[float], stddev_multiplier: float = 2.0) -> tuple[bool, str]:
"""Alert if metric exceeds recent mean by N standard deviations."""

if len(metric_history) < 10:
return False, None # Need history

recent_mean = statistics.mean(metric_history[-100:])
recent_stdev = statistics.stdev(metric_history[-100:])
current_value = metric_history[-1]

# Alert if current value exceeds mean + (stdev * multiplier)
upper_bound = recent_mean + (recent_stdev * stddev_multiplier)

if current_value > upper_bound:
z_score = (current_value - recent_mean) / recent_stdev if recent_stdev > 0 else 0
return True, f"Anomaly: {current_value:.2f} is {z_score:.1f} stddevs above mean {recent_mean:.2f}"

return False, None

# Example
latency_history = [1.2, 1.3, 1.1, 1.4, 1.2, 1.3, 2.5, 1.2] # Spike to 2.5s
is_anomaly, message = check_adaptive_thresholds(latency_history)
if is_anomaly:
print(f"ALERT: {message}")

Approach 3: SLO-Based Thresholds

Define Service-Level Objectives (SLOs) and alert when they are at risk:

# SLO: 95% of requests return within 2 seconds
slo_latency_seconds = 2.0
slo_target_pct = 95

def check_slo_health(latencies: list[float], slo_seconds: float, slo_pct: float) -> tuple[bool, str]:
"""Check if SLO is at risk."""

if not latencies:
return False, None

# Calculate percentile
sorted_latencies = sorted(latencies)
index = int(len(sorted_latencies) * slo_pct / 100)
latency_pct = sorted_latencies[index]

# Alert if percentile exceeds SLO
if latency_pct > slo_seconds:
cushion_pct = ((slo_seconds - latency_pct) / slo_seconds) * 100
return True, f"SLO at risk: {slo_pct}th percentile {latency_pct:.2f}s exceeds {slo_seconds}s target ({cushion_pct:.0f}% over budget)"

return False, None

Multi-Condition Alerts

Combine multiple metrics to reduce false positives. For example, alert on high cost only if it is due to increased inference volume, not just a few expensive calls:

# Alert: Significant cost spike AND call volume is unusually high
- alert: CostSpikeWithHighVolume
expr: |
(increase(llm_cost_usd_total[1h]) > 50) and
(increase(llm_calls_total[1h]) > 10000)
for: 10m
annotations:
summary: "Cost spike with high volume"
description: "Cost increased $50 in 1h with >10k calls. Check for runaway loops or inefficient prompts."

Alert Notification Routing

Route alerts to appropriate channels based on severity:

from enum import Enum

class AlertSeverity(Enum):
INFO = 1
WARNING = 2
CRITICAL = 3

def route_alert(alert_message: str, severity: AlertSeverity):
"""Route alert to appropriate notification channel."""

import requests

channels = {
AlertSeverity.INFO: ["#llm-ops-logs"], # Slack channel for logs
AlertSeverity.WARNING: ["#llm-ops-alerts"], # Slack channel for warnings
AlertSeverity.CRITICAL: ["#llm-ops-critical", "pagerduty-llm-on-call"], # Slack + PagerDuty
}

for channel in channels.get(severity, []):
if channel.startswith("#"):
# Slack channel
requests.post(
"https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
json={"text": f"[{severity.name}] {alert_message}", "channel": channel}
)
elif channel.startswith("pagerduty"):
# Trigger PagerDuty incident
requests.post(
"https://events.pagerduty.com/v2/enqueue",
json={
"routing_key": "YOUR_PAGERDUTY_ROUTING_KEY",
"event_action": "trigger",
"payload": {
"summary": alert_message,
"severity": severity.name.lower(),
"source": "llm-observability"
}
}
)

# Example
route_alert("Cost exceeded $100/hour", AlertSeverity.CRITICAL)

Alert Runbooks

For each alert, provide a runbook with debugging steps:

# Alert: HighLatencyDegradation

## What it means
LLM API latency p99 (99th percentile) has increased by more than 50% compared to the previous hour.

## Common causes
1. API provider is experiencing saturation (check their status page)
2. Model is overwhelmed (high inference queue)
3. Network latency has increased (check AWS/Azure region latency)
4. Prompt size is unusually large (check recent prompt changes)

## Debug steps
1. Check Langfuse dashboard: "Latency by Model" and "Latency by Hour"
2. Check API provider status: [Anthropic Status](https://status.anthropic.com)
3. Check distributed traces: filter for traces with latency > 3s
4. Correlate with recent deployments: did a prompt change or new feature ship?

## Remediation
- **Immediate**: Reduce max_tokens or simplify prompts to reduce LLM inference time
- **Short-term**: Implement request queuing or rate limiting to reduce API load
- **Long-term**: Migrate to a faster model (e.g., Claude 3.5 Sonnet) or increase API quota

Key Takeaways

  • Alert on cost spikes (anomalies vs rolling average), latency degradation (percentile increases), quality scores (thresholds), and error rates (category-specific).
  • Use adaptive thresholds (based on recent standard deviation) to reduce false positives compared to static thresholds.
  • Combine multiple conditions (high cost AND high volume) to filter noisy alerts.
  • Route alerts by severity: info to logs, warnings to team Slack, critical to PagerDuty.
  • Provide runbooks (documented debugging steps) for every alert to speed incident response.

Frequently Asked Questions

How long should the evaluation window be for alerts?

Use 5–10 minutes for real-time detection of acute issues (cost spikes, rate-limit errors). Use 1 hour for gradual degradation (latency increase, quality decline). Avoid windows shorter than 5 minutes to reduce noise.

What is a reasonable SLO for LLM latency?

P95 latency under 2 seconds for interactive applications. P99 under 5 seconds. For background batch jobs, SLO can be much higher (30–60 seconds). Adjust based on user expectations and business impact.

Should I alert on every single error or only error rate anomalies?

Alert on error rate anomalies (e.g., error rate exceeds 1%) rather than individual errors. Individual error alerts are too noisy for production systems. Exception: alert immediately on critical errors (API auth failure, system crash).

How do I test alert thresholds before deploying?

Simulate the condition in a staging environment and verify the alert fires. Use a dry-run mode to test notification routing without actually sending Slack/PagerDuty messages. Once confident, deploy to production with the threshold initially set to conservative values (higher cost threshold, longer evaluation window), then tighten over time.

Further Reading