Skip to main content

AI SaaS Monitoring: Production Observability

Production failures in AI SaaS are silent and expensive. A degraded LLM response might go unnoticed for hours, costing you thousands in uncaught tokens and user churn. Monitoring is not optional; it is the difference between a thriving service and a disaster. This article covers the key metrics to track (latency, cost, error rates), how to set up dashboards and alerts, and strategies for detecting and responding to incidents when they occur.

What is Observability in AI SaaS?

Observability is the ability to understand the internal state of your system based on its outputs: logs, metrics, and traces. Unlike a traditional CRUD application where you can count successful transactions, an AI SaaS system is harder to observe because "success" is subjective (Did the LLM response actually help the user?). You need metrics that capture latency (how long did the LLM take?), cost (how many tokens did we use?), error rates (how often did the API fail?), and user satisfaction (did the response meet expectations?). Observability lets you detect when something is wrong before your users complain.

Key Metrics to Track

Cost Metrics

# Track LLM costs in real-time
from prometheus_client import Counter, Histogram, Gauge
import os

# Prometheus metrics for cost tracking
tokens_processed = Counter(
'llm_tokens_processed_total',
'Total tokens processed by model',
['model', 'organization_id']
)

cost_usd = Counter(
'llm_cost_usd_total',
'Total LLM cost in USD',
['model', 'organization_id']
)

monthly_budget_remaining = Gauge(
'organization_budget_remaining_usd',
'Remaining budget for organization this month',
['organization_id']
)

@app.post("/api/v1/completions")
async def create_completion(request: Request, prompt: str):
"""Track cost metrics for every request."""

org_id = request.state.organization_id

# Generate response
response = await llm_provider.generate(prompt, model="claude-3-5-sonnet-20241022")

# Record metrics
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
request_cost = calculate_cost("claude-3-5-sonnet-20241022", input_tokens, output_tokens)

tokens_processed.labels(
model="claude-3-5-sonnet-20241022",
organization_id=org_id
).inc(input_tokens + output_tokens)

cost_usd.labels(
model="claude-3-5-sonnet-20241022",
organization_id=org_id
).inc(request_cost)

# Update budget gauge
org = db.query(Organization).filter(Organization.id == org_id).first()
month_usage = get_month_usage(org_id)
remaining = max(0, org.monthly_budget_usd - month_usage)

monthly_budget_remaining.labels(organization_id=org_id).set(remaining)

return {"response": response.content}

Performance Metrics

# Track latency and throughput
from prometheus_client import Histogram

response_latency = Histogram(
'llm_response_latency_seconds',
'LLM response latency in seconds',
['model', 'result_status'],
buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0)
)

@app.post("/api/v1/completions")
async def create_completion_with_latency(request: Request, prompt: str):
"""Track latency for every request."""

import time

start = time.time()

try:
response = await llm_provider.generate(prompt)
result_status = "success"
except RateLimitError:
result_status = "rate_limited"
raise
except Exception as e:
result_status = "error"
raise
finally:
elapsed = time.time() - start
response_latency.labels(
model="claude-3-5-sonnet-20241022",
result_status=result_status
).observe(elapsed)

return {"response": response.content}

Error and Failure Metrics

# Track error rates
from prometheus_client import Counter

errors = Counter(
'llm_errors_total',
'Total LLM errors',
['error_type', 'organization_id']
)

@app.exception_handler(Exception)
async def error_handler(request: Request, exc: Exception):
"""Track errors globally."""

org_id = request.state.organization_id or "unknown"
error_type = type(exc).__name__

errors.labels(error_type=error_type, organization_id=org_id).inc()

logger.error(f"Error in {request.url.path}: {exc}", exc_info=True)

return JSONResponse(
status_code=500,
content={"error": "Internal server error"}
)

Setting Up Dashboards

A good dashboard shows the health of your service at a glance: cost, latency, error rate, and quota status.

Grafana Dashboard Query Examples

# Cost per organization (daily)
rate(llm_cost_usd_total[24h]) by (organization_id)

# P95 latency by model
histogram_quantile(0.95, llm_response_latency_seconds) by (model)

# Error rate (%)
rate(llm_errors_total[5m]) / rate(llm_response_latency_seconds_count[5m]) * 100

# Budget headroom (%)
organization_budget_remaining_usd / organization_monthly_budget_usd * 100

Alerting Rules

Define alerts that trigger when something goes wrong:

# Prometheus alert rules (alerting.yaml)
groups:
- name: llm_saas_alerts
interval: 30s
rules:
# Alert if error rate >5% for 5 minutes
- alert: HighErrorRate
expr: |
(rate(llm_errors_total[5m]) / rate(llm_response_latency_seconds_count[5m])) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"

# Alert if p95 latency >5 seconds
- alert: HighLatency
expr: |
histogram_quantile(0.95, llm_response_latency_seconds) > 5
for: 10m
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"

# Alert if organization approaches budget
- alert: OrganizationBudgetWarning
expr: |
organization_budget_remaining_usd / organization_monthly_budget_usd < 0.2
for: 5m
annotations:
summary: "Organization approaching budget limit"
description: "{{ $labels.organization_id }} has {{ $value | humanizePercentage }} budget remaining"

# Alert if a specific organization's cost increases >50% day-over-day
- alert: AnomalousOrganizationCost
expr: |
rate(llm_cost_usd_total[24h]) > 1.5 * rate(llm_cost_usd_total[24h] offset 1d)
for: 30m
annotations:
summary: "Anomalous cost increase"
description: "{{ $labels.organization_id }} cost increased {{ $value | humanizePercentage }}"

Distributed Tracing

For complex flows (request -> gateway -> service -> LLM -> cache), use distributed tracing to see where latency is coming from:

# OpenTelemetry tracing setup
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(jaeger_exporter))

tracer = trace.get_tracer(__name__)

@app.post("/api/v1/completions")
async def create_completion_with_tracing(request: Request, prompt: str):
"""Trace request flow for observability."""

with tracer.start_as_current_span("completion_request") as span:
span.set_attribute("prompt_length", len(prompt))
span.set_attribute("organization_id", request.state.organization_id)

# Trace cache lookup
with tracer.start_as_current_span("cache_lookup"):
cached = await cache.get(prompt_hash)
span.set_attribute("cache_hit", cached is not None)

if cached:
return {"response": cached}

# Trace LLM call
with tracer.start_as_current_span("llm_generation"):
response = await llm_provider.generate(prompt)

# Trace storage
with tracer.start_as_current_span("store_result"):
await cache.set(prompt_hash, response)

return {"response": response}

Health Checks and Uptime Monitoring

Expose a health endpoint that third-party monitoring services can poll:

@app.get("/health")
async def health_check():
"""Simple health check endpoint."""

# Check database connectivity
try:
db.execute("SELECT 1")
except Exception as e:
return JSONResponse(
status_code=503,
content={"status": "unhealthy", "reason": "database unavailable"}
)

# Check LLM provider connectivity (non-blocking)
try:
# Make a minimal API call to check connectivity
# timeout=2 to avoid blocking the health check
await asyncio.wait_for(
llm_provider.health_check(),
timeout=2
)
except asyncio.TimeoutError:
logger.warning("LLM provider slow to respond")
except Exception as e:
logger.warning(f"LLM provider health check failed: {e}")

# Check cache connectivity
try:
cache.ping()
except Exception as e:
return JSONResponse(
status_code=503,
content={"status": "unhealthy", "reason": "cache unavailable"}
)

return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"version": os.environ.get("VERSION", "unknown")
}

Monitoring Metrics Comparison

MetricWhy It MattersAlert Threshold
Error RateEarly detection of bugs or API issues>5% for 5 min
Latency (p95)Detects performance degradation>5 seconds
Cost per OrgDetects abuse or runaway bugs>150% of daily average
Token UsageEarly warning of token exhaustion>80% of quota
UptimeService availability<99.5%

Key Takeaways

  • Instrument every LLM API call with metrics: cost, latency, and result status.
  • Set up Prometheus/Grafana dashboards showing organization cost, error rate, and latency percentiles.
  • Define alerting rules for high error rate, high latency, and anomalous cost spikes.
  • Use distributed tracing (OpenTelemetry/Jaeger) to pinpoint which component is slow.
  • Expose a health check endpoint for uptime monitoring services to poll regularly.

Frequently Asked Questions

What should my SLO (Service Level Objective) be for an AI SaaS?

Typical SLOs: 99.5% uptime (4.32 hours downtime/month), P95 latency <5 seconds, error rate <0.5%. Adjust based on your use case: real-time chat needs tighter SLOs than batch processing.

How do I detect if my LLM provider is having an outage?

Monitor error rates and latency spikes. If both increase simultaneously across all models, the provider is likely down. Check the provider's status page. Implement automatic fallback to a backup provider if available.

Should I alert on every error, or just error rate spikes?

Alert on error rate spikes (e.g., "error rate >5% for 5 min"), not individual errors. Individual errors are noise; spikes indicate a real problem. Log all errors for debugging, but alert only on thresholds.

How do I correlate a cost spike with the root cause (which organization, which model)?

Tag all metrics with organization_id and model. Use Grafana's drill-down: click a cost spike to see which org caused it, then drill into that org's request logs to find the problematic prompts.

Further Reading