Skip to main content

LLM Observability Dashboard: Langfuse to Production

An LLM observability dashboard consolidates logs, traces, and metrics into a single interface for real-time monitoring and historical analysis. Tools like Langfuse, Datadog, and Arize provide purpose-built dashboards that understand LLM costs, latency, token counts, and model versions. A production dashboard should display: daily and hourly cost trends, latency percentiles (p50, p95, p99), error rates by model, top users by token consumption, and drill-down capability to inspect individual traces. This article walks you through building a dashboard with Langfuse (a free, open-source LLM observability platform) and integrating it into a production LLM application.

Langfuse: A Lightweight LLM Observability Platform

Langfuse is a purpose-built observability platform for LLM applications. It provides:

  • Trace visualization — View the complete execution flow of a request, including all API calls, latency, tokens, and costs.
  • Cost tracking — Automatic cost calculation per inference, user, and model.
  • Latency metrics — TTFT, total duration, and latency percentiles.
  • Error tracking — Categorized errors with root-cause details.
  • User analytics — Token consumption and cost per user over time.

You can self-host Langfuse or use their cloud service.

Setting Up Langfuse

Option 1: Cloud (Fastest)

Sign up at langfuse.com, create a project, and copy your API keys.

Option 2: Self-Hosted (Docker)

docker run -d \
-e DATABASE_URL="postgresql://user:password@db:5432/langfuse" \
-e NEXTAUTH_SECRET="your-secret" \
-p 3000:3000 \
langfuse/langfuse:latest

Then navigate to http://localhost:3000.

Instrumenting Your LLM App with Langfuse

Install the Python SDK:

pip install langfuse

Initialize Langfuse and wrap your LLM calls:

from langfuse import Langfuse
from anthropic import Anthropic

# Initialize Langfuse (reads LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY from environment)
langfuse = Langfuse()

def query_llm(user_message: str, user_id: str = None):
"""LLM query instrumented with Langfuse."""

# Start a trace (top-level observation)
trace = langfuse.trace(
name="llm_query",
user_id=user_id,
metadata={"model": "claude-3-opus-20250219"}
)

# Create a span for embedding (if using embeddings)
embedding_span = trace.span(name="embedding")
# ... compute embedding ...
embedding_span.end(output={"embedding_dim": 1536})

# Create a span for vector search
search_span = trace.span(name="vector_search")
# ... search vector database ...
search_span.end(output={"results": 5})

# Create a span for LLM call
llm_span = trace.span(name="llm_call")

try:
client = Anthropic()
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)

# Log the LLM output
llm_span.generation(
name="claude_generation",
model="claude-3-opus-20250219",
input=user_message,
output=response.content[0].text,
usage={
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"total_tokens": response.usage.input_tokens + response.usage.output_tokens
}
)

llm_span.end()

# End the trace (automatically calculates cost based on tokens)
trace.end(output=response.content[0].text)

return response.content[0].text

except Exception as e:
llm_span.end(status="error", error=str(e))
trace.end(status="error", error=str(e))
raise

# Example
result = query_llm("What is machine learning?", user_id="user_42")

Langfuse automatically:

  • Calculates cost based on model and token counts
  • Measures latency for each span
  • Groups traces by user and model
  • Detects errors and categorizes them

Langfuse Dashboard Views

Once traces are flowing into Langfuse, access these dashboards:

Dashboard: Overview

Shows KPIs for the selected time window:

  • Total traces
  • Total cost (USD)
  • Median latency
  • Error rate
  • Requests per minute

Dashboard: Traces

Filter and search traces by:

  • User ID
  • Model
  • Latency range
  • Error status
  • Custom metadata

Click a trace to see the full span DAG, token counts, and cost breakdown.

Dashboard: Analytics

Visualizations for historical trends:

  • Cost over time — Line graph of daily spending.
  • Latency percentiles — p50, p95, p99 latency trends.
  • Token consumption by user — Bar chart of top users by tokens used.
  • Error rate by model — Stacked bar chart showing error rate per model.

Dashboard: Users

Per-user analytics:

  • Total tokens consumed
  • Total cost
  • Number of inferences
  • Error count

Custom Metrics in Langfuse

You can attach custom metrics to traces for business logic insights:

from langfuse import Langfuse

langfuse = Langfuse()

def score_response(user_message: str, response: str, user_id: str):
"""LLM query with manual quality scoring."""

trace = langfuse.trace(name="scored_inference", user_id=user_id)

# ... perform inference ...

# Manually score the response (1-10 scale)
quality_score = score_response_quality(response) # Your scoring function

# Log the score as a custom metric
trace.score(
name="response_quality",
value=quality_score,
data_type="numeric"
)

# Log feedback (e.g., user thumbs up/down)
trace.event(
name="user_feedback",
input={"response": response},
output={"rating": "thumbs_up"}
)

trace.end()

def score_response_quality(response: str) -> float:
"""Example: score response length and vocabulary diversity."""
import re
tokens = response.split()
unique_tokens = len(set(tokens))
diversity = unique_tokens / len(tokens) if tokens else 0
return min(10, diversity * 5) # Scale to 1-10

In the Langfuse dashboard, you can now filter traces by quality_score and correlate it with cost or latency.

Exporting Dashboard Data

Langfuse provides an API to programmatically fetch data for custom analysis or visualization:

import requests

# Fetch all traces from the last 24 hours
def fetch_traces_api(hours_back: int = 24):
"""Fetch traces via Langfuse API."""

import os
from datetime import datetime, timedelta

api_key = os.getenv("LANGFUSE_SECRET_KEY")
base_url = "https://api.langfuse.com" # or your self-hosted URL

# Calculate timestamp
from_timestamp = (datetime.utcnow() - timedelta(hours=hours_back)).isoformat()

response = requests.get(
f"{base_url}/traces",
headers={"Authorization": f"Bearer {api_key}"},
params={
"fromTimestamp": from_timestamp,
"limit": 100
}
)

traces = response.json().get("data", [])
return traces

# Analyze traces
traces = fetch_traces_api(hours_back=24)
total_cost = sum(t.get("totalCost", 0) for t in traces)
avg_latency = sum(t.get("latency", 0) for t in traces) / len(traces) if traces else 0

print(f"24h cost: ${total_cost:.2f}, avg latency: {avg_latency:.0f}ms")

Use this to build custom reports or feed data into BI tools like Tableau or Looker.

Alerting on Dashboard Metrics

Most observability platforms support alert rules. In Langfuse (via webhooks), you can trigger notifications:

# Example webhook to send Slack notifications on cost spikes
from fastapi import FastAPI, HTTPException

app = FastAPI()

@app.post("/langfuse_webhook")
async def langfuse_webhook(payload: dict):
"""Receive Langfuse alerts and trigger notifications."""

cost = payload.get("totalCost")
if cost > 100: # Alert if daily cost exceeds $100
# Send Slack message
import requests
requests.post(
"https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
json={"text": f"LLM cost spike: ${cost:.2f} in the last hour"}
)

return {"status": "ok"}

Configure the webhook in Langfuse settings to post to this endpoint whenever metrics cross thresholds.

Comparison: Langfuse vs Datadog vs Arize

FeatureLangfuseDatadogArize
Cost trackingNativeVia custom metricsNative
Token metricsNativeVia custom spansNative
Self-hostedYes, freeNoNo
Latency metricsYesYesLimited
Error trackingYesYesYes
PricingFree tier or pay-per-tracePer-metric, ~$0.10/K tracesper-trace, ~$0.001–0.01

Langfuse is ideal for cost-conscious startups and open-source projects. Datadog is best for enterprises already using Datadog for infrastructure. Arize specializes in LLM/ML model monitoring and drift detection.

Key Takeaways

  • An LLM observability dashboard consolidates metrics, costs, and traces for real-time and historical analysis.
  • Langfuse is a lightweight, open-source platform purpose-built for LLM observability; self-hosted or cloud.
  • Automatic cost calculation per trace, user, and model enables cost attribution and quota enforcement.
  • Custom metrics (quality scores, user feedback) enable correlation between cost and business outcomes.
  • API access allows custom reporting and integration with BI tools.

Frequently Asked Questions

How much does Langfuse cost?

Langfuse offers a free tier for up to 10,000 traces/month. Cloud pricing is $0.10 per 1,000 traces thereafter. Self-hosted is free; you pay only for infrastructure.

Can I use Langfuse alongside OpenTelemetry?

Yes. Langfuse integrates with OpenTelemetry; traces created via OpenTelemetry SDK can be exported to Langfuse via the Langfuse exporter.

How long does Langfuse retain trace data?

Cloud retention is 90 days by default; self-hosted depends on your database. You can export traces to cold storage (S3, GCS) for long-term archival.

Can I correlate Langfuse traces with external logs (Datadog, Splunk)?

Yes, via trace IDs. Include the trace ID in your logs, and log aggregation tools can cross-reference traces and logs by ID.

Further Reading