Skip to main content

Advanced LLM Tracing: Agents, Multi-Step Chains

Advanced LLM tracing handles complex scenarios: agent loops (where the LLM decides what action to take next), multi-model chains (combining specialized models), and reasoning steps (breaking a problem into intermediate steps). In these systems, a single user request can spawn dozens of LLM calls, tool invocations, and conditional logic. Tracing must capture the decision tree (why did the agent choose action B instead of action A?), measure cost and latency across the entire reasoning process, and enable debugging when intermediate steps fail or produce unexpected results.

Agent Loop Tracing

An agent loop is a pattern where the LLM acts as a decision-maker:

  1. Human provides initial query
  2. Agent reads the query and decides what action to take (call a tool, search, or answer)
  3. Agent executes the action and observes the result
  4. Agent loops back to step 2 until it decides the answer is final

Here is how to trace this loop:

from opentelemetry import trace
from anthropic import Anthropic
import json

tracer = trace.get_tracer(__name__)

def agent_loop(user_query: str, max_iterations: int = 5):
"""Agent loop with tracing for decision flow."""

with tracer.start_as_current_span("agent_loop") as root_span:
root_span.set_attribute("max_iterations", max_iterations)
root_span.set_attribute("user_query", user_query[:100])

client = Anthropic()
iteration = 0

# Agent state
messages = [
{"role": "user", "content": user_query}
]

# Tools available to the agent
tools = [
{
"name": "search",
"description": "Search the web for information",
"input_schema": {"type": "object", "properties": {"query": {"type": "string"}}}
},
{
"name": "calculate",
"description": "Perform a mathematical calculation",
"input_schema": {"type": "object", "properties": {"expression": {"type": "string"}}}
}
]

while iteration < max_iterations:
iteration += 1

# Create span for this iteration
with tracer.start_as_current_span(f"agent_iteration_{iteration}") as iter_span:
iter_span.set_attribute("iteration", iteration)

# Agent decides what to do
with tracer.start_as_current_span("agent_decision") as decision_span:
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=1024,
tools=tools,
messages=messages
)

decision_span.set_attribute("stop_reason", response.stop_reason)
decision_span.set_attribute("input_tokens", response.usage.input_tokens)
decision_span.set_attribute("output_tokens", response.usage.output_tokens)

# If agent decided to use a tool, execute it
if response.stop_reason == "tool_use":
for content_block in response.content:
if content_block.type == "tool_use":
tool_name = content_block.name
tool_input = content_block.input

# Trace tool execution
with tracer.start_as_current_span(f"tool_call_{tool_name}") as tool_span:
tool_span.set_attribute("tool_name", tool_name)
tool_span.set_attribute("tool_input", json.dumps(tool_input)[:200])

# Execute tool (simulated)
if tool_name == "search":
tool_result = search_web(tool_input.get("query"))
elif tool_name == "calculate":
tool_result = evaluate_expression(tool_input.get("expression"))
else:
tool_result = "Tool not found"

tool_span.set_attribute("tool_result_length", len(str(tool_result)))

# Add tool result to messages for next iteration
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": content_block.id,
"content": str(tool_result)
}
]
})

iter_span.set_attribute("action", f"tool_call_{tool_name}")

elif response.stop_reason == "end_turn":
# Agent decided it has the final answer
iter_span.set_attribute("action", "final_answer")

final_answer = ""
for content_block in response.content:
if hasattr(content_block, "text"):
final_answer = content_block.text

root_span.set_attribute("final_answer", final_answer[:200])
return final_answer

# Max iterations reached
root_span.set_attribute("status", "max_iterations_exceeded")
return "Agent reached max iterations without a final answer"

def search_web(query: str) -> str:
"""Simulate web search."""
return f"Results for '{query}': [Result 1, Result 2, Result 3]"

def evaluate_expression(expr: str) -> str:
"""Simulate mathematical evaluation."""
try:
result = eval(expr) # In production, use a safe math evaluator
return str(result)
except:
return "Invalid expression"

# Example
result = agent_loop("What is 15 * (10 + 5)?")
print(result)

The trace now shows:

  • Agent loop root span containing all iterations
  • Each iteration as a child span showing the agent's decision and action
  • Tool calls as sub-spans of iterations
  • Token counts and latency for each decision and tool call

In Jaeger, you see a waterfall: iterations execute sequentially, each containing a decision followed by a tool call, enabling you to spot: which tool calls are slow, which decisions generated unexpected results, and total cost across the entire agent loop.

Recursive Trace Handling: Chain-of-Thought

For chain-of-thought reasoning, the LLM breaks a problem into intermediate steps, then solves each step:

def chain_of_thought_reasoning(problem: str, depth: int = 0, max_depth: int = 3):
"""Recursive chain-of-thought reasoning with tracing."""

current_span = trace.get_current_span()

# Create a nested span for this reasoning step
with tracer.start_as_current_span(f"reasoning_step_{depth}") as span:
span.set_attribute("depth", depth)
span.set_attribute("problem", problem[:100])

if depth >= max_depth:
span.set_attribute("reason_for_stop", "max_depth_reached")
return "Base case reached"

client = Anthropic()

# Ask LLM to break down the problem
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Break down this problem into one intermediate step and the final answer:\n{problem}"
}]
)

reasoning = response.content[0].text
span.set_attribute("reasoning_length", len(reasoning))

# Check if LLM identified a subproblem
if "step:" in reasoning.lower():
# Extract subproblem and recurse
subproblem = reasoning.split("step:")[-1].split("\n")[0]

with tracer.start_as_current_span("recursive_call") as recursive_span:
recursive_span.set_attribute("subproblem", subproblem[:100])
result = chain_of_thought_reasoning(subproblem, depth=depth + 1, max_depth=max_depth)

span.set_attribute("recursion_result", result[:100])

return reasoning

# Example
cot_result = chain_of_thought_reasoning("What is the capital of France and what is its population?")

In Jaeger, recursive traces form a tree: the root span (depth 0) contains a child span (depth 1), which contains a child span (depth 2), and so on. You can see how many recursion levels were needed and where time was spent at each level.

Multi-Model Workflows

Some chains use different models for different steps (a cheap model for classification, an expensive model for generation):

def multi_model_workflow(user_input: str):
"""Workflow combining multiple models."""

with tracer.start_as_current_span("multi_model_workflow") as root_span:
client_anthropic = Anthropic()

# Step 1: Classify user intent with a fast model
with tracer.start_as_current_span("step_1_classify") as span1:
span1.set_attribute("model", "claude-3-5-haiku-20250128")

response_classify = client_anthropic.messages.create(
model="claude-3-5-haiku-20250128",
max_tokens=50,
messages=[{
"role": "user",
"content": f"Classify this as support/sales/general:\n{user_input}"
}]
)

intent = response_classify.content[0].text.strip().lower()
span1.set_attribute("intent", intent)
span1.set_attribute("cost", response_classify.usage.input_tokens * 0.00000080)

# Step 2: Generate detailed response based on intent
with tracer.start_as_current_span("step_2_generate") as span2:
model_for_intent = {
"support": "claude-3-opus-20250219", # Expensive but good for complex support
"sales": "claude-3-5-sonnet-20250514", # Balanced for sales pitch
"general": "claude-3-5-haiku-20250128" # Cheap for simple general queries
}

selected_model = model_for_intent.get(intent, "claude-3-5-sonnet-20250514")
span2.set_attribute("model", selected_model)

response_generate = client_anthropic.messages.create(
model=selected_model,
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Based on this intent ({intent}), respond helpfully:\n{user_input}"
}]
)

generated_response = response_generate.content[0].text
span2.set_attribute("cost", response_generate.usage.output_tokens * 0.000075) # Varies by model

# Step 3: Optional safety check with a specialized model
if intent == "support":
with tracer.start_as_current_span("step_3_safety_check") as span3:
span3.set_attribute("model", "claude-3-5-haiku-20250128")

safety_check = client_anthropic.messages.create(
model="claude-3-5-haiku-20250128",
max_tokens=10,
messages=[{
"role": "user",
"content": f"Does this response contain harmful content? yes/no:\n{generated_response}"
}]
)

is_safe = "no" in safety_check.content[0].text.lower()
span3.set_attribute("is_safe", is_safe)

return generated_response

# Example
output = multi_model_workflow("I have a billing issue with my account")

In the trace, you see: classify (cheap, fast), generate (model selected based on intent), optional safety check. If safety check fails, you might record that and route the response to a human reviewer rather than sending it to the user.

Measuring Branching and Concurrency

When a workflow has conditional branches or parallel operations, trace them properly:

import asyncio

async def parallel_research(query: str):
"""Execute parallel research tasks with tracing."""

with tracer.start_as_current_span("parallel_research") as root_span:
root_span.set_attribute("query", query[:100])

client = Anthropic()

# Parallel task 1: Search for academic papers
async def research_academic():
with tracer.start_as_current_span("research_academic"):
# Simulate API call
await asyncio.sleep(0.5)
return "Academic paper summary: ..."

# Parallel task 2: Fetch news articles
async def research_news():
with tracer.start_as_current_span("research_news"):
# Simulate API call
await asyncio.sleep(0.3)
return "News headline: ..."

# Parallel task 3: Generate analysis via LLM
async def analyze_with_llm():
with tracer.start_as_current_span("analyze_with_llm"):
response = client.messages.create(
model="claude-3-opus-20250219",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Analyze this query:\n{query}"
}]
)
return response.content[0].text

# Run all tasks in parallel
academic, news, analysis = await asyncio.gather(
research_academic(),
research_news(),
analyze_with_llm()
)

root_span.set_attribute("total_subtasks", 3)
return f"Results: {academic}\n{news}\n{analysis}"

# Run async workflow
result = asyncio.run(parallel_research("quantum computing applications"))

In Jaeger, parallel tasks appear side-by-side on the timeline. If one task is slow (research_academic takes 500ms while research_news takes 300ms), you immediately see that research_academic is the bottleneck and the overall span duration is ~500ms.

Debugging Complex Traces

For deeply nested traces, use span attributes to add context:

def complex_workflow_with_debugging(user_request: str):
"""Complex workflow with rich debugging context."""

with tracer.start_as_current_span("complex_workflow") as root_span:
root_span.set_attribute("user_id", "user_123")
root_span.set_attribute("request_id", "req_456")
root_span.set_attribute("workflow_version", "2.1")

# Log decision points
root_span.add_event(
name="workflow_started",
attributes={
"timestamp": "2026-06-02T14:30:00Z",
"input_length": len(user_request)
}
)

# ... perform steps ...

# Log final state
root_span.add_event(
name="workflow_completed",
attributes={
"success": True,
"total_llm_calls": 5,
"total_tokens": 2500,
"total_cost_usd": 0.045
}
)

Events (timestamped logs within a span) help you track workflow progression without creating separate spans for every minor step.

Key Takeaways

  • Agent loops require tracing each iteration (decision, action, observation) so you can debug agent behavior and track cumulative cost.
  • Recursive traces (chain-of-thought) form a tree in Jaeger, showing depth and breadth of reasoning.
  • Multi-model workflows trace model selection decisions and cost per step, enabling you to optimize which model to use for which task.
  • Parallel operations in traces appear side-by-side, revealing bottlenecks and total workflow latency.
  • Rich span attributes and events enable detailed debugging of complex workflows without excessive span proliferation.

Frequently Asked Questions

How do I prevent trace explosion in deeply nested workflows?

Use sampling: trace 100% of errors and slow traces, but only 5–10% of successful fast traces. This keeps the number of traces manageable while ensuring visibility into failures.

Traces typically go 5–7 levels deep (reasonable for agent loops and chain-of-thought). If you exceed 10 levels, consider restructuring your workflow to use fewer intermediate steps.

How do I correlate an agent's tool calls with external tool logs?

Pass the trace ID to the tool when you invoke it, and have the tool include the trace ID in its own logs. Then cross-reference in your log aggregation platform (Datadog, Splunk) using the shared trace ID.

Should I trace every tool call in an agent loop?

Yes, trace every tool call so you can debug which tool the agent chose and whether it produced the expected result. This visibility is essential for understanding and improving agent behavior.

Further Reading