Skip to main content

Observability and monitoring LLM systems

Observability in LLM systems is the ability to understand what is happening in production inference by collecting and analyzing metrics, logs, and traces. Unlike traditional software where bugs are deterministic, LLM failures are often soft (low quality, hallucinations, drift) and require rich context to debug. Observability infrastructure captures request-level details: which model version served the request, what was the input, what was the output, quality score, latency, cost, and any errors. Dashboards aggregate this data to show trends: is accuracy stable, is latency increasing, which prompts fail most often? Alerts notify your team when metrics deviate from baseline. With strong observability, you catch quality regressions hours after deployment, not days later when users complain.

Three Pillars of Observability: Metrics, Logs, Traces

Metrics are time-series data points (e.g., accuracy: 0.87 at 10:05 UTC, latency_p99: 1200ms at 10:05 UTC). Metrics answer: "Is the system healthy?" Aggregate metrics to dashboards; set thresholds and alerts.

Logs are structured events (e.g., {"request_id": "req-123", "model": "claude-3-5-sonnet-v1.2.3", "query": "...", "output": "...", "quality": 0.91}). Logs provide detail; you query them to debug specific failures.

Traces are end-to-end request flows showing latency breakdown (time in queue, model inference, post-processing). Traces help you understand why a request is slow.

Instrumentation: Collecting Data

Instrument your LLM service to emit metrics, logs, and traces.

import time
import json
import logging
import random
from datetime import datetime
from anthropic import Anthropic

# Setup structured logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("llm-inference")

class InstrumentedLLMService:
def __init__(self, model: str):
self.client = Anthropic()
self.model = model

def infer(self, query: str, system_prompt: str, request_id: str = None) -> dict:
"""Inference with full observability instrumentation."""
request_id = request_id or f"req-{random.randint(100000, 999999)}"
start_time = time.time()

# Emit trace event: inference start
self._emit_trace("inference_start", request_id, {
"model": self.model,
"query_length": len(query)
})

try:
# Call model
inference_start = time.time()
response = self.client.messages.create(
model=self.model,
max_tokens=500,
system=system_prompt,
messages=[{"role": "user", "content": query}]
)
inference_latency_ms = (time.time() - inference_start) * 1000
output = response.content[0].text

# Compute quality metric (placeholder)
quality_score = self._score_output(query, output)

# Calculate metrics
total_latency_ms = (time.time() - start_time) * 1000

# Emit metrics to monitoring system
self._emit_metric("llm_inference_latency_ms", inference_latency_ms, {
"model": self.model,
"percentile": "p99"
})
self._emit_metric("llm_output_quality", quality_score, {
"model": self.model
})
self._emit_metric("llm_cost_cents", 1.5, { # placeholder
"model": self.model
})

# Emit structured log
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"event": "inference_complete",
"model": self.model,
"query_length": len(query),
"output_length": len(output),
"quality_score": quality_score,
"latency_ms": total_latency_ms,
"cost_cents": 1.5,
"status": "success"
}
logger.info(json.dumps(log_entry))

return {
"request_id": request_id,
"output": output,
"quality_score": quality_score,
"latency_ms": total_latency_ms,
"cost_cents": 1.5
}

except Exception as e:
# Emit error metric and log
self._emit_metric("llm_inference_error_total", 1, {
"model": self.model,
"error_type": type(e).__name__
})

latency_ms = (time.time() - start_time) * 1000
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"event": "inference_failed",
"model": self.model,
"error": str(e),
"error_type": type(e).__name__,
"latency_ms": latency_ms,
"status": "error"
}
logger.error(json.dumps(log_entry))
raise

def _score_output(self, query: str, output: str) -> float:
"""Score output quality (placeholder: random score)."""
return random.uniform(0.7, 0.95)

def _emit_metric(self, metric_name: str, value: float, tags: dict):
"""Emit metric to monitoring system (Datadog, Prometheus, etc.)."""
# Pseudo-code: datadog_client.gauge(metric_name, value, tags=tags)
print(f"METRIC: {metric_name}={value} {json.dumps(tags)}")

def _emit_trace(self, event: str, request_id: str, data: dict):
"""Emit trace event to distributed tracing system (Jaeger, etc.)."""
# Pseudo-code: tracer.start_span(event, attributes={request_id, **data})
print(f"TRACE: {event} request_id={request_id} {json.dumps(data)}")

Key Metrics for LLM Systems

Define metrics aligned with your business and operational goals:

MetricUnitThresholdInterpretation
inference_latency_p99milliseconds< 2000 ms99th percentile request latency
output_quality_score0-1>= 0.85Semantic quality (accuracy, relevance)
error_ratepercent< 1%% of requests that fail
hallucination_ratepercent< 2%% outputs containing false information
toxicity_ratepercent< 0.5%% outputs violating content policy
cost_per_requestcents< 2.0API cost per inference
model_versioncategorical-Which model version served request
prompt_versioncategorical-Which prompt version was used
def compute_quality_metrics(outputs: list[str], references: list[str]) -> dict:
"""Compute aggregate quality metrics over a batch."""
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

similarities = []
for output, reference in zip(outputs, references):
output_emb = embedder.encode(output)
reference_emb = embedder.encode(reference)
similarity = float(output_emb @ reference_emb)
similarities.append(similarity)

return {
"accuracy": sum(1 for s in similarities if s > 0.85) / len(similarities),
"avg_similarity": sum(similarities) / len(similarities),
"min_similarity": min(similarities),
"max_similarity": max(similarities)
}

Building Dashboards

Create dashboards that surface the most important metrics at a glance. A typical LLM operations dashboard shows:

  • Overview: current accuracy, error rate, latency, cost (big numbers, red if bad).
  • Trend graphs: accuracy over 24 hours, error rate over 7 days, cost per day.
  • Model breakdown: accuracy by model version (Claude 3 vs 3.5, versions 1.2 vs 1.3).
  • Prompt breakdown: accuracy by prompt version, detect which prompts underperform.
  • Latency distribution: histogram of latencies, p50 / p95 / p99 percentiles.
  • Cost tracking: total cost, cost per request, cost trend.
# Prometheus dashboard config (simplified)
dashboard:
title: "LLM Operations"
refresh: 30s
panels:
- title: "Current Accuracy"
type: stat
metric: "llm_output_quality"
condition: "avg(llm_output_quality) over 1h"
threshold: {green: 0.85, red: 0.75}

- title: "Error Rate"
type: stat
metric: "llm_inference_error_total"
condition: "rate(llm_inference_error_total[5m])"
threshold: {green: 0.01, red: 0.05}

- title: "Accuracy Trend (24h)"
type: graph
metrics:
- "avg(llm_output_quality) by (model_version)"
time_range: 24h

- title: "Latency Distribution (p50, p95, p99)"
type: graph
metrics:
- "histogram_quantile(0.5, llm_inference_latency_ms)"
- "histogram_quantile(0.95, llm_inference_latency_ms)"
- "histogram_quantile(0.99, llm_inference_latency_ms)"

Alerting

Define alerts that notify the team of issues. Start with a few critical alerts and expand as your system matures.

# Prometheus alert rules
groups:
- name: llm_system
rules:
- alert: HighErrorRate
expr: "rate(llm_inference_error_total[5m]) > 0.05"
for: 5m
annotations:
summary: "LLM error rate high (5%+)"
description: "Error rate is {{ $value | humanizePercentage }}"
action: "Page on-call; check logs for error patterns"

- alert: AccuracyDegraded
expr: "avg(llm_output_quality) < 0.75"
for: 30m
annotations:
summary: "LLM accuracy below 75%"
description: "Accuracy is {{ $value | humanize }}"
action: "Investigate latest deployment; consider rollback"

- alert: HighLatency
expr: "histogram_quantile(0.99, llm_inference_latency_ms) > 3000"
for: 10m
annotations:
summary: "LLM p99 latency high (>3s)"
action: "Check model load; consider scaling"

Debugging with Logs and Traces

When an alert fires or a user reports an issue, use logs and traces to debug. Query logs for failures in a specific time window or for a specific model version. Look for patterns: do all failures share a common input pattern (e.g., non-English text)? Did accuracy drop right after a deployment?

def debug_accuracy_regression(model: str, start_time: str, end_time: str):
"""Query logs to debug accuracy drop for a model."""

# Query logs for model and time range
logs = query_logs({
"model": model,
"status": "success",
"timestamp": {
"gte": start_time,
"lte": end_time
}
})

# Group by prompt_version and compare quality
by_prompt = {}
for log in logs:
prompt = log.get("prompt_version", "unknown")
quality = log.get("quality_score", 0)

if prompt not in by_prompt:
by_prompt[prompt] = []
by_prompt[prompt].append(quality)

# Compute average quality per prompt
for prompt, scores in by_prompt.items():
avg = sum(scores) / len(scores)
print(f"{prompt}: avg quality {avg:.2f} (n={len(scores)})")

# Identify low-performing inputs
low_quality = [log for log in logs if log.get("quality_score", 1) < 0.70]
print(f"\nLow-quality outputs ({len(low_quality)}):")
for log in low_quality[:5]:
print(f" Query: {log['query'][:50]}...")
print(f" Quality: {log['quality_score']:.2f}")

Key Takeaways

  • Observability in LLM systems requires collecting metrics (time-series), logs (structured events), and traces (request flows).
  • Key metrics include inference latency, output quality, error rate, hallucination rate, toxicity rate, and cost per request.
  • Build dashboards that surface accuracy, error rate, and latency trends; break down by model version and prompt version.
  • Define alerts for critical conditions: high error rate, accuracy below threshold, latency spike, cost spike.
  • Use logs and traces to debug failures; query for patterns (e.g., which prompts fail most often).

Frequently Asked Questions

What is the difference between monitoring and observability?

Monitoring means watching predefined metrics and alerting on thresholds (e.g., alert if accuracy < 0.75). Observability means having enough visibility into your system to ask arbitrary questions (e.g., "Why did accuracy drop for Spanish queries?"). Strong observability enables ad hoc debugging; monitoring alone detects that something is wrong.

Should I log every inference request?

Yes, for observability. Log at least request_id, model, query length, output length, quality score, latency, and cost. For large-scale systems, you may sample logs (e.g., log 10% of requests) to reduce storage costs, but always log errors and low-quality outputs.

How long should I retain logs?

Retain detailed logs for 30 days (enough for post-mortems and debugging). Archive older logs to cold storage (S3, GCS) for compliance. Aggregate logs to metrics and retain metrics for 1-2 years for trend analysis.

Can I use the same monitoring system for LLMs as traditional services?

Yes, mostly. Prometheus, Datadog, New Relic, and Grafana work fine. You just need to instrument LLM-specific metrics (quality score, hallucination rate, model version). Some specialized platforms (Weights & Biases, Arize) offer LLM-specific monitoring; consider them if you need deep ML insights.

What should I do if I find a pattern of failures in logs (e.g., all failures are non-English queries)?

Document the pattern and decide: refine the prompt to handle that case better, add a note to the prompt about limitations, or file a bug to improve the model. Use logs to track the fix: compare outputs before and after the fix to confirm the pattern is resolved.

Further Reading