Advanced Observability: Traces, Metrics, and Dashboards
Advanced observability goes beyond simple monitoring: it combines distributed tracing (detailed request flow), metrics aggregation (performance signals), and dashboarding (human-readable visualizations) into a unified view of production LLM behavior. This article covers how to instrument your LLM system for observability, emit structured logs and traces, aggregate metrics, and build actionable dashboards that enable operators to diagnose issues in minutes, not hours.
The observability stack: traces, metrics, logs
Logs are unstructured text (or semi-structured JSON) describing discrete events. Example: "Generated output for request_id=123 in 1.2s". Logs are easy to produce but hard to query and correlate.
Metrics are quantitative measurements: counts, histograms, gauges. Example: latency p95, error rate, tokens per second. Metrics are efficient to store and query but lossy (you can't reconstruct individual requests).
Traces are detailed records of a single request's journey through your system: which models called which APIs, how long each step took, what data flowed where. Traces are expensive to store but invaluable for diagnosing issues.
Best practice: use all three. Logs for narrative details, metrics for trends, traces for debugging.
Instrumenting your LLM system for tracing
Add tracing hooks at key points in your LLM pipeline:
# Structured tracing with OpenTelemetry (OTel)
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import time
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
class InstrumentedLLMPipeline:
def __init__(self, model, evaluator):
self.model = model
self.evaluator = evaluator
def generate_and_evaluate(self, prompt, user_id=None):
"""
Full pipeline with tracing.
"""
with tracer.start_as_current_span("llm_request") as span:
span.set_attribute("user_id", user_id)
span.set_attribute("prompt_length", len(prompt))
# Tracing: preprocessing
with tracer.start_as_current_span("preprocess_prompt") as preprocess_span:
cleaned_prompt = self._preprocess(prompt)
preprocess_span.set_attribute("original_length", len(prompt))
preprocess_span.set_attribute("cleaned_length", len(cleaned_prompt))
# Tracing: LLM inference
with tracer.start_as_current_span("llm_inference") as inference_span:
start = time.time()
output = self.model.generate(cleaned_prompt)
latency_ms = (time.time() - start) * 1000
inference_span.set_attribute("latency_ms", latency_ms)
inference_span.set_attribute("output_length", len(output))
# Tracing: evaluation
with tracer.start_as_current_span("evaluate_output") as eval_span:
scores = self.evaluator.evaluate(cleaned_prompt, output)
eval_span.set_attribute("overall_score", scores.get("overall", 0.5))
eval_span.set_attribute("relevance", scores.get("relevance", 0.5))
# Span-level events for notable occurrences
if scores.get("overall", 0.5) < 0.6:
span.add_event("low_quality_output")
return {
"output": output,
"scores": scores,
"latency_ms": latency_ms
}
def _preprocess(self, prompt):
return prompt.strip()
Metrics: aggregating quantitative signals
Emit metrics at key checkpoints. Use a metrics library like Prometheus or StatsD:
# Prometheus metrics for LLM systems
from prometheus_client import Counter, Histogram, Gauge, CollectorRegistry, push_to_gateway
# Define metrics
request_counter = Counter(
"llm_requests_total",
"Total LLM requests",
["model", "status"]
)
inference_latency = Histogram(
"llm_inference_latency_ms",
"LLM inference latency in milliseconds",
["model"],
buckets=(10, 50, 100, 250, 500, 1000)
)
quality_score = Gauge(
"llm_output_quality_score",
"Output quality score (0-1)",
["model"]
)
error_rate = Gauge(
"llm_error_rate",
"Current error rate",
["model"]
)
class MetricsEmitter:
def __init__(self, model_name="gpt-4"):
self.model_name = model_name
def record_request(self, latency_ms, quality_score_value, error=False):
"""
Record a single LLM request.
"""
# Counter
status = "error" if error else "success"
request_counter.labels(model=self.model_name, status=status).inc()
# Histogram
inference_latency.labels(model=self.model_name).observe(latency_ms)
# Gauge
quality_score.labels(model=self.model_name).set(quality_score_value)
# Push to Prometheus
# (Or use PushGateway for batch exports)
def compute_rolling_error_rate(self, recent_requests, window_size=100):
"""
Compute rolling error rate for dashboard.
"""
if len(recent_requests) == 0:
return 0.0
error_count = sum(1 for r in recent_requests if r.get("error", False))
rate = error_count / len(recent_requests)
error_rate.labels(model=self.model_name).set(rate)
return rate
# Example
emitter = MetricsEmitter(model_name="gpt-4")
emitter.record_request(latency_ms=150, quality_score_value=0.85)
emitter.record_request(latency_ms=200, quality_score_value=0.72)
emitter.record_request(latency_ms=1500, quality_score_value=0.3, error=True)
Dashboard design for LLM observability
A good observability dashboard combines multiple visualizations to answer key questions:
1. System health panel (top-left):
- Error rate (red if > 5%)
- P95 latency (yellow if > 500ms, red if > 1s)
- Quality score (green if > 0.8, yellow if > 0.7, red otherwise)
2. Time-series charts (center):
- Latency over time (with baseline band)
- Error rate over time
- Quality score over time
- Request volume per minute
3. Distribution charts (bottom):
- Histogram of latency distribution
- Quality score distribution
- Output length distribution
4. Cohort breakdown (right):
- Error rate by user tier
- Quality score by domain
- Latency by endpoint
5. Alerts and events timeline (bottom):
- Recent alerts (quality drop, error spike, regression)
- Deploys, config changes, incidents
# Dashboard layout configuration (pseudo-code)
dashboard_config = {
"title": "LLM Production Observability",
"refresh_interval_seconds": 5,
"panels": [
{
"type": "stat",
"title": "Error Rate",
"metric": "llm_error_rate",
"threshold": {"yellow": 0.05, "red": 0.1}
},
{
"type": "stat",
"title": "P95 Latency (ms)",
"metric": "llm_inference_latency_ms",
"percentile": 95,
"threshold": {"yellow": 500, "red": 1000}
},
{
"type": "stat",
"title": "Avg Quality Score",
"metric": "llm_output_quality_score",
"threshold": {"yellow": 0.7, "red": 0.6}
},
{
"type": "line_chart",
"title": "Latency Over Time",
"metrics": ["llm_inference_latency_ms"],
"aggregations": ["p50", "p95", "p99"]
},
{
"type": "line_chart",
"title": "Error Rate Over Time",
"metrics": ["llm_error_rate"]
},
{
"type": "bar_chart",
"title": "Request Volume",
"metrics": ["llm_requests_total"],
"group_by": "status"
},
{
"type": "heatmap",
"title": "Quality by Cohort",
"metrics": ["llm_output_quality_score"],
"dimensions": ["user_tier", "domain"]
},
{
"type": "table",
"title": "Recent Alerts",
"data_source": "alerts_table"
}
]
}
Correlating traces with metrics
Use traces to diagnose why metrics degrade:
# Trace-based debugging: find outliers
class TraceAnalyzer:
def __init__(self, trace_store):
self.trace_store = trace_store
def find_slow_requests(self, model, percentile=95, limit=10):
"""
Find the slowest requests for a given model.
"""
traces = self.trace_store.query({
"model": model,
"span_type": "llm_request"
})
# Sort by total latency
traces.sort(key=lambda t: t["total_duration_ms"], reverse=True)
# Get percentile
slow_threshold = np.percentile(
[t["total_duration_ms"] for t in traces],
percentile
)
slow_requests = [t for t in traces if t["total_duration_ms"] > slow_threshold][:limit]
# Analyze breakdown
for request in slow_requests:
print(f"Request {request['request_id']}:")
print(f" Total: {request['total_duration_ms']}ms")
print(f" Preprocessing: {request.get('spans', {}).get('preprocess_prompt', {}).get('duration_ms', 0)}ms")
print(f" Inference: {request.get('spans', {}).get('llm_inference', {}).get('duration_ms', 0)}ms")
print(f" Evaluation: {request.get('spans', {}).get('evaluate_output', {}).get('duration_ms', 0)}ms")
return slow_requests
def find_low_quality_traces(self, model, threshold=0.6, limit=10):
"""
Find requests with low quality scores and examine their traces.
"""
traces = self.trace_store.query({
"model": model,
"quality_score": {"$lt": threshold}
})
# Extract common attributes
low_quality = []
for trace in traces[:limit]:
low_quality.append({
"request_id": trace["request_id"],
"quality_score": trace["quality_score"],
"prompt_length": trace.get("prompt_length"),
"output_length": trace.get("output_length"),
"latency_ms": trace["total_duration_ms"]
})
return low_quality
# Example
analyzer = TraceAnalyzer(trace_store)
slow_requests = analyzer.find_slow_requests("gpt-4", percentile=95)
low_quality = analyzer.find_low_quality_traces("gpt-4", threshold=0.6)
Contextual alerts from observability data
Use your observability stack to trigger smart alerts:
# Smart alerts based on correlated signals
class ContextualAlertGenerator:
def __init__(self, metrics_store, trace_store):
self.metrics = metrics_store
self.traces = trace_store
def detect_issue_patterns(self):
"""
Detect correlated issues that might indicate a real problem.
"""
alerts = []
# Pattern 1: Latency spike + quality drop
recent_latency = self.metrics.get_timeseries("llm_inference_latency_ms", minutes=10)
recent_quality = self.metrics.get_timeseries("llm_output_quality_score", minutes=10)
if np.mean(recent_latency[-1:]) > np.mean(recent_latency[:-1]) * 1.5:
if np.mean(recent_quality[-1:]) < np.mean(recent_quality[:-1]) * 0.9:
alerts.append({
"type": "LATENCY_AND_QUALITY_DROP",
"severity": "high",
"message": "Latency spike correlated with quality drop. Possible API degradation or model issue."
})
# Pattern 2: Error rate increase without latency change
recent_errors = self.metrics.get_timeseries("llm_error_rate", minutes=10)
if np.mean(recent_errors[-1:]) > np.mean(recent_errors[:-1]) * 2:
if np.mean(recent_latency[-1:]) == np.mean(recent_latency[:-1]):
alerts.append({
"type": "ERROR_SPIKE_NO_LATENCY",
"severity": "medium",
"message": "Error rate spike without latency change. Likely infrastructure issue."
})
# Pattern 3: Quality drop on specific cohort
quality_by_cohort = self.metrics.get_by_dimension("llm_output_quality_score", "user_tier")
for cohort, quality in quality_by_cohort.items():
if quality < 0.65:
alerts.append({
"type": "COHORT_QUALITY_DROP",
"severity": "medium",
"cohort": cohort,
"message": f"Quality dropped for cohort {cohort}: {quality:.2f}"
})
return alerts
# Example
alert_generator = ContextualAlertGenerator(metrics_store, trace_store)
alerts = alert_generator.detect_issue_patterns()
for alert in alerts:
print(f"ALERT: {alert['message']} (Severity: {alert['severity']})")
Key Takeaways
- Observability = traces (request flow) + metrics (quantitative trends) + logs (narrative details).
- Instrument your LLM pipeline with tracing libraries (OpenTelemetry) to capture request journey and timing.
- Emit structured metrics (latency, error rate, quality) to track trends over time.
- Build dashboards that show system health, trends, distributions, and cohort breakdowns.
- Use traces to diagnose metric anomalies: drill down from "quality dropped" to specific slow/failed requests.
Frequently Asked Questions
How much tracing overhead is acceptable?
<1% latency overhead is ideal; < 5% is acceptable. Sample traces (e.g., 10% of requests) to reduce overhead while maintaining visibility.
Should I log every request, or sample?
Sample logs (1-5% for normal operation), but log 100% of errors and anomalies. This keeps storage costs down while preserving debugging information.
What's a good alert threshold for quality score?
Depends on your SLA. For critical systems, alert if quality < 0.75. For less critical systems, alert if quality < 0.65. Use your historical data to find the right level.
How do I correlate traces with metrics and alerts?
Use request_id as the common key. Every trace, metric, and alert should carry a request_id so you can link them in your dashboard.
Can I use APM tools (Datadog, New Relic) for LLM observability?
Yes. Most APM tools support custom metrics and tracing. Use them if you're already invested; otherwise OpenTelemetry + Prometheus + Grafana is a solid open-source stack.