Distributed Tracing Basics: Follow LLM Chains
Distributed tracing is a technique for tracking the execution path of a request across multiple services, processes, or asynchronous tasks. A trace is composed of spans, where each span represents a unit of work: an HTTP request, a database query, an LLM API call, or a function execution. Spans are linked by parent–child relationships, forming a directed acyclic graph (DAG) that shows the exact order and concurrency of operations. For LLM applications, distributed tracing reveals bottlenecks in multi-step chains (prompt preprocessing, embeddings, vector search, LLM inference, post-processing) and helps debug failures where a single request flows through multiple services or retries.
Span Structure and Relationships
A span has three core attributes: a unique span_id, a parent_span_id (linking to its parent), and a trace_id (linking to the root request). Every span also records: start time, end time (derived duration), a name, attributes (key-value metadata), events (logs emitted within the span), and an optional error status.
Example span for an LLM API call:
{
"trace_id": "550e8400-e29b-41d4-a716-446655440000",
"span_id": "f0ca7b1a-51e2-4d37-9476-b5e9a3d5c9f2",
"parent_span_id": "a1b2c3d4-e5f6-4g7h-8i9j-k0l1m2n3o4p5",
"name": "llm.call.claude",
"start_time": "2026-06-02T14:30:00Z",
"end_time": "2026-06-02T14:30:01.200Z",
"duration_ms": 1200,
"attributes": {
"model": "claude-3-opus-20250219",
"max_tokens": 1024,
"temperature": 0.7
},
"events": [
{"timestamp": "2026-06-02T14:30:00.050Z", "message": "API request sent"}
],
"status": "ok"
}
Parent–child relationships form a tree when operations are sequential (A, then B, then C all under the same parent) or a DAG when operations are concurrent (B and C both children of A, executed simultaneously). Here is a sequential LLM chain:
Trace ID: req-123
Root span (process_request): 0–2000 ms
├─ Span 1 (fetch_context): 0–50 ms
├─ Span 2 (embed_query): 50–150 ms
├─ Span 3 (vector_search): 150–200 ms
└─ Span 4 (llm_call): 200–1900 ms
├─ Span 4a (prompt_assembly): 200–210 ms
└─ Span 4b (api_call): 210–1900 ms
In this trace, spans 1, 2, and 3 are sequential (each waits for the previous to finish), all under the root span. Span 4 contains two sub-spans: prompt assembly, then the API call.
Creating and Ending Spans in Your Code
Most distributed tracing libraries (Jaeger, Datadog, OpenTelemetry) provide a tracer API. Here is a Python example using OpenTelemetry:
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(agent_host_name="localhost", agent_port=6831)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(jaeger_exporter))
tracer = trace.get_tracer(__name__)
def process_user_query(user_message: str):
"""Process a user query with multi-step LLM chain."""
# Root span for the entire request
with tracer.start_as_current_span("process_query") as root_span:
root_span.set_attribute("user_message_length", len(user_message))
# Child span 1: Fetch context
with tracer.start_as_current_span("fetch_context") as span:
span.set_attribute("user_id", 42)
context = fetch_user_context(42)
span.set_attribute("context_size_bytes", len(str(context)))
# Child span 2: Generate embeddings
with tracer.start_as_current_span("embed_query") as span:
span.set_attribute("model", "text-embedding-3-small")
embedding = embed_text(user_message)
span.set_attribute("embedding_dim", len(embedding))
# Child span 3: Vector search
with tracer.start_as_current_span("vector_search") as span:
span.set_attribute("database", "pinecone")
results = vector_search(embedding, top_k=5)
span.set_attribute("results_found", len(results))
# Child span 4: LLM call (with sub-spans)
with tracer.start_as_current_span("llm_call") as span:
span.set_attribute("model", "claude-3-opus-20250219")
# Sub-span: Assemble prompt
with tracer.start_as_current_span("prompt_assembly") as sub_span:
prompt = assemble_prompt(user_message, context, results)
sub_span.set_attribute("prompt_length", len(prompt))
# Sub-span: API call
with tracer.start_as_current_span("api_call") as sub_span:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
sub_span.set_attribute("input_tokens", response.usage.input_tokens)
sub_span.set_attribute("output_tokens", response.usage.output_tokens)
sub_span.set_attribute("stop_reason", response.stop_reason)
return response.content[0].text
def fetch_user_context(user_id):
"""Simulate fetching user context from database."""
import time
time.sleep(0.05)
return {"orders": 3, "lifetime_value": 450}
def embed_text(text):
"""Simulate embedding (in practice, call OpenAI Embeddings API)."""
import time
time.sleep(0.1)
return [0.1] * 1536 # Mock embedding
def vector_search(embedding, top_k):
"""Simulate vector search."""
import time
time.sleep(0.05)
return [{"id": i, "score": 0.95 - i * 0.05} for i in range(top_k)]
def assemble_prompt(user_message, context, results):
"""Assemble the final prompt."""
return f"User: {user_message}\nContext: {context}\nRelevant: {results}"
When this code runs, the tracer automatically records:
- The start and end time of each span
- The duration (computed as end time minus start time)
- Parent–child relationships (the
current_span()context automatically sets the parent) - Attributes you set with
set_attribute()
The tracer then exports these spans to Jaeger (or another backend) for visualization.
Trace Visualization
Jaeger and similar tools render traces as Gantt charts. For the example above, the trace UI shows:
process_query [=========================================] 1200 ms
├─ fetch_context [==] 50 ms
├─ embed_query [======] 100 ms
├─ vector_search [===] 50 ms
└─ llm_call [==========================] 950 ms
├─ prompt_assembly [=] 10 ms
└─ api_call [======================] 920 ms
From this visualization, you immediately see:
- The llm_call is the critical path (takes 950 ms of the 1200 ms total).
- Within llm_call, the api_call is slow (920 ms); prompt assembly is negligible.
- Sequential operations (fetch_context, embed_query, vector_search, llm_call) take 1200 ms total; if you could parallelize them, total time might drop to 950 ms (the longest sequential chain).
Handling Concurrency: Parallel Spans
If your LLM chain fetches context and embeddings in parallel, the span DAG changes:
import asyncio
async def process_query_parallel(user_message: str):
"""Process query with parallel context fetch and embedding."""
with tracer.start_as_current_span("process_query") as root_span:
# Run these two tasks in parallel
async def fetch_and_embed():
with tracer.start_as_current_span("fetch_context"):
context = await async_fetch_context(42)
with tracer.start_as_current_span("embed_query"):
embedding = await async_embed(user_message)
return context, embedding
context, embedding = await fetch_and_embed()
# Vector search (sequential after embeddings)
with tracer.start_as_current_span("vector_search"):
results = await async_vector_search(embedding)
# LLM call (sequential)
with tracer.start_as_current_span("llm_call"):
prompt = assemble_prompt(user_message, context, results)
response = await async_llm_call(prompt)
return response
In the trace visualization, fetch_context and embed_query would now appear side-by-side (overlapping on the timeline), both starting at time 0 under the root span, rather than sequentially. This shows that parallelization could reduce total time from 1200 ms to approximately 900 ms (if embeddings and context fetch take 100 ms each in parallel, plus vector search 50 ms plus LLM call 750 ms).
Error Tracking Within Spans
If an operation fails, record the error in the span:
from opentelemetry import trace
def llm_call_with_error_tracking(prompt: str):
"""LLM call with error recording."""
with tracer.start_as_current_span("llm_call") as span:
try:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
span.set_attribute("status", "success")
return response
except Exception as e:
span.set_attribute("status", "error")
span.set_attribute("error_type", type(e).__name__)
span.set_attribute("error_message", str(e))
span.record_exception(e)
raise
The span now records the error type, message, and stack trace. In Jaeger, this span appears in red, and the error is visible in the trace detail panel. This enables debugging: "All llm_call spans that failed due to 'RateLimitError' in the last hour" queries.
Key Takeaways
- Distributed tracing records the execution flow of a request via a DAG of spans, each with a name, start time, end time, and attributes.
- Parent–child relationships in spans show causality and serial vs. parallel execution.
- Trace visualization (Gantt charts) reveals bottlenecks: if the root span is 1200 ms and one child span is 950 ms, that span is the critical path.
- Attributes (model name, token counts, user ID, etc.) and events (log messages within spans) add context for debugging.
- Error recording within spans links failures to specific operations, enabling root-cause analysis.
Frequently Asked Questions
What is the difference between a trace and a span?
A span is a single unit of work (a function call, an API request). A trace is a collection of spans linked by parent–child relationships that together represent the execution flow of a single request from start to finish.
Do I need to manually create a span for every function call?
No. Use automatic instrumentation libraries (OpenTelemetry has plugins for popular frameworks like Django, FastAPI, and httpx) that intercept function calls and create spans automatically. For LLM-specific operations (embeddings, vector searches), you may need to add manual span creation.
How do I correlate logs with spans?
Emit logs with the current span's trace_id and span_id embedded in the log entry. Log aggregation tools can then render logs alongside spans in the same trace detail view.
What sampling strategy should I use for traces?
Sample 100% of errors and slow traces (latency > p99), and sample 1–5% of successful fast traces (cost/storage). This ensures you see failures and degradation while keeping storage costs reasonable. Adjust the sample rate based on your inference volume and storage budget.
Further Reading
- Jaeger Distributed Tracing — Open-source tracing backend with visualization
- OpenTelemetry Tracing Documentation — Official standard for tracing instrumentation
- Datadog APM Guide — Commercial distributed tracing service with LLM support
- Understanding Distributed Tracing — O'Reilly book on distributed tracing concepts