Observability and monitoring LLM systems
Observability in LLM systems is the ability to understand what is happening in production inference by collecting and analyzing metrics, logs, and traces. Unlike traditional software where bugs are deterministic, LLM failures are often soft (low quality, hallucinations, drift) and require rich context to debug. Observability infrastructure captures request-level details: which model version served the request, what was the input, what was the output, quality score, latency, cost, and any errors. Dashboards aggregate this data to show trends: is accuracy stable, is latency increasing, which prompts fail most often? Alerts notify your team when metrics deviate from baseline. With strong observability, you catch quality regressions hours after deployment, not days later when users complain.
Three Pillars of Observability: Metrics, Logs, Traces
Metrics are time-series data points (e.g., accuracy: 0.87 at 10:05 UTC, latency_p99: 1200ms at 10:05 UTC). Metrics answer: "Is the system healthy?" Aggregate metrics to dashboards; set thresholds and alerts.
Logs are structured events (e.g., {"request_id": "req-123", "model": "claude-3-5-sonnet-v1.2.3", "query": "...", "output": "...", "quality": 0.91}). Logs provide detail; you query them to debug specific failures.
Traces are end-to-end request flows showing latency breakdown (time in queue, model inference, post-processing). Traces help you understand why a request is slow.
Instrumentation: Collecting Data
Instrument your LLM service to emit metrics, logs, and traces.
import time
import json
import logging
import random
from datetime import datetime
from anthropic import Anthropic
# Setup structured logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("llm-inference")
class InstrumentedLLMService:
def __init__(self, model: str):
self.client = Anthropic()
self.model = model
def infer(self, query: str, system_prompt: str, request_id: str = None) -> dict:
"""Inference with full observability instrumentation."""
request_id = request_id or f"req-{random.randint(100000, 999999)}"
start_time = time.time()
# Emit trace event: inference start
self._emit_trace("inference_start", request_id, {
"model": self.model,
"query_length": len(query)
})
try:
# Call model
inference_start = time.time()
response = self.client.messages.create(
model=self.model,
max_tokens=500,
system=system_prompt,
messages=[{"role": "user", "content": query}]
)
inference_latency_ms = (time.time() - inference_start) * 1000
output = response.content[0].text
# Compute quality metric (placeholder)
quality_score = self._score_output(query, output)
# Calculate metrics
total_latency_ms = (time.time() - start_time) * 1000
# Emit metrics to monitoring system
self._emit_metric("llm_inference_latency_ms", inference_latency_ms, {
"model": self.model,
"percentile": "p99"
})
self._emit_metric("llm_output_quality", quality_score, {
"model": self.model
})
self._emit_metric("llm_cost_cents", 1.5, { # placeholder
"model": self.model
})
# Emit structured log
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"event": "inference_complete",
"model": self.model,
"query_length": len(query),
"output_length": len(output),
"quality_score": quality_score,
"latency_ms": total_latency_ms,
"cost_cents": 1.5,
"status": "success"
}
logger.info(json.dumps(log_entry))
return {
"request_id": request_id,
"output": output,
"quality_score": quality_score,
"latency_ms": total_latency_ms,
"cost_cents": 1.5
}
except Exception as e:
# Emit error metric and log
self._emit_metric("llm_inference_error_total", 1, {
"model": self.model,
"error_type": type(e).__name__
})
latency_ms = (time.time() - start_time) * 1000
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"event": "inference_failed",
"model": self.model,
"error": str(e),
"error_type": type(e).__name__,
"latency_ms": latency_ms,
"status": "error"
}
logger.error(json.dumps(log_entry))
raise
def _score_output(self, query: str, output: str) -> float:
"""Score output quality (placeholder: random score)."""
return random.uniform(0.7, 0.95)
def _emit_metric(self, metric_name: str, value: float, tags: dict):
"""Emit metric to monitoring system (Datadog, Prometheus, etc.)."""
# Pseudo-code: datadog_client.gauge(metric_name, value, tags=tags)
print(f"METRIC: {metric_name}={value} {json.dumps(tags)}")
def _emit_trace(self, event: str, request_id: str, data: dict):
"""Emit trace event to distributed tracing system (Jaeger, etc.)."""
# Pseudo-code: tracer.start_span(event, attributes={request_id, **data})
print(f"TRACE: {event} request_id={request_id} {json.dumps(data)}")
Key Metrics for LLM Systems
Define metrics aligned with your business and operational goals:
| Metric | Unit | Threshold | Interpretation |
|---|---|---|---|
| inference_latency_p99 | milliseconds | < 2000 ms | 99th percentile request latency |
| output_quality_score | 0-1 | >= 0.85 | Semantic quality (accuracy, relevance) |
| error_rate | percent | < 1% | % of requests that fail |
| hallucination_rate | percent | < 2% | % outputs containing false information |
| toxicity_rate | percent | < 0.5% | % outputs violating content policy |
| cost_per_request | cents | < 2.0 | API cost per inference |
| model_version | categorical | - | Which model version served request |
| prompt_version | categorical | - | Which prompt version was used |
def compute_quality_metrics(outputs: list[str], references: list[str]) -> dict:
"""Compute aggregate quality metrics over a batch."""
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
similarities = []
for output, reference in zip(outputs, references):
output_emb = embedder.encode(output)
reference_emb = embedder.encode(reference)
similarity = float(output_emb @ reference_emb)
similarities.append(similarity)
return {
"accuracy": sum(1 for s in similarities if s > 0.85) / len(similarities),
"avg_similarity": sum(similarities) / len(similarities),
"min_similarity": min(similarities),
"max_similarity": max(similarities)
}
Building Dashboards
Create dashboards that surface the most important metrics at a glance. A typical LLM operations dashboard shows:
- Overview: current accuracy, error rate, latency, cost (big numbers, red if bad).
- Trend graphs: accuracy over 24 hours, error rate over 7 days, cost per day.
- Model breakdown: accuracy by model version (Claude 3 vs 3.5, versions 1.2 vs 1.3).
- Prompt breakdown: accuracy by prompt version, detect which prompts underperform.
- Latency distribution: histogram of latencies, p50 / p95 / p99 percentiles.
- Cost tracking: total cost, cost per request, cost trend.
# Prometheus dashboard config (simplified)
dashboard:
title: "LLM Operations"
refresh: 30s
panels:
- title: "Current Accuracy"
type: stat
metric: "llm_output_quality"
condition: "avg(llm_output_quality) over 1h"
threshold: {green: 0.85, red: 0.75}
- title: "Error Rate"
type: stat
metric: "llm_inference_error_total"
condition: "rate(llm_inference_error_total[5m])"
threshold: {green: 0.01, red: 0.05}
- title: "Accuracy Trend (24h)"
type: graph
metrics:
- "avg(llm_output_quality) by (model_version)"
time_range: 24h
- title: "Latency Distribution (p50, p95, p99)"
type: graph
metrics:
- "histogram_quantile(0.5, llm_inference_latency_ms)"
- "histogram_quantile(0.95, llm_inference_latency_ms)"
- "histogram_quantile(0.99, llm_inference_latency_ms)"
Alerting
Define alerts that notify the team of issues. Start with a few critical alerts and expand as your system matures.
# Prometheus alert rules
groups:
- name: llm_system
rules:
- alert: HighErrorRate
expr: "rate(llm_inference_error_total[5m]) > 0.05"
for: 5m
annotations:
summary: "LLM error rate high (5%+)"
description: "Error rate is {{ $value | humanizePercentage }}"
action: "Page on-call; check logs for error patterns"
- alert: AccuracyDegraded
expr: "avg(llm_output_quality) < 0.75"
for: 30m
annotations:
summary: "LLM accuracy below 75%"
description: "Accuracy is {{ $value | humanize }}"
action: "Investigate latest deployment; consider rollback"
- alert: HighLatency
expr: "histogram_quantile(0.99, llm_inference_latency_ms) > 3000"
for: 10m
annotations:
summary: "LLM p99 latency high (>3s)"
action: "Check model load; consider scaling"
Debugging with Logs and Traces
When an alert fires or a user reports an issue, use logs and traces to debug. Query logs for failures in a specific time window or for a specific model version. Look for patterns: do all failures share a common input pattern (e.g., non-English text)? Did accuracy drop right after a deployment?
def debug_accuracy_regression(model: str, start_time: str, end_time: str):
"""Query logs to debug accuracy drop for a model."""
# Query logs for model and time range
logs = query_logs({
"model": model,
"status": "success",
"timestamp": {
"gte": start_time,
"lte": end_time
}
})
# Group by prompt_version and compare quality
by_prompt = {}
for log in logs:
prompt = log.get("prompt_version", "unknown")
quality = log.get("quality_score", 0)
if prompt not in by_prompt:
by_prompt[prompt] = []
by_prompt[prompt].append(quality)
# Compute average quality per prompt
for prompt, scores in by_prompt.items():
avg = sum(scores) / len(scores)
print(f"{prompt}: avg quality {avg:.2f} (n={len(scores)})")
# Identify low-performing inputs
low_quality = [log for log in logs if log.get("quality_score", 1) < 0.70]
print(f"\nLow-quality outputs ({len(low_quality)}):")
for log in low_quality[:5]:
print(f" Query: {log['query'][:50]}...")
print(f" Quality: {log['quality_score']:.2f}")
Key Takeaways
- Observability in LLM systems requires collecting metrics (time-series), logs (structured events), and traces (request flows).
- Key metrics include inference latency, output quality, error rate, hallucination rate, toxicity rate, and cost per request.
- Build dashboards that surface accuracy, error rate, and latency trends; break down by model version and prompt version.
- Define alerts for critical conditions: high error rate, accuracy below threshold, latency spike, cost spike.
- Use logs and traces to debug failures; query for patterns (e.g., which prompts fail most often).
Frequently Asked Questions
What is the difference between monitoring and observability?
Monitoring means watching predefined metrics and alerting on thresholds (e.g., alert if accuracy < 0.75). Observability means having enough visibility into your system to ask arbitrary questions (e.g., "Why did accuracy drop for Spanish queries?"). Strong observability enables ad hoc debugging; monitoring alone detects that something is wrong.
Should I log every inference request?
Yes, for observability. Log at least request_id, model, query length, output length, quality score, latency, and cost. For large-scale systems, you may sample logs (e.g., log 10% of requests) to reduce storage costs, but always log errors and low-quality outputs.
How long should I retain logs?
Retain detailed logs for 30 days (enough for post-mortems and debugging). Archive older logs to cold storage (S3, GCS) for compliance. Aggregate logs to metrics and retain metrics for 1-2 years for trend analysis.
Can I use the same monitoring system for LLMs as traditional services?
Yes, mostly. Prometheus, Datadog, New Relic, and Grafana work fine. You just need to instrument LLM-specific metrics (quality score, hallucination rate, model version). Some specialized platforms (Weights & Biases, Arize) offer LLM-specific monitoring; consider them if you need deep ML insights.
What should I do if I find a pattern of failures in logs (e.g., all failures are non-English queries)?
Document the pattern and decide: refine the prompt to handle that case better, add a note to the prompt about limitations, or file a bug to improve the model. Use logs to track the fix: compare outputs before and after the fix to confirm the pattern is resolved.