Skip to main content

Parallel Tool Use: Speed Up Agent Workflows

When an agent needs to gather data from multiple independent sources—checking inventory, fetching weather, querying a database—it can call all of them in parallel instead of sequentially. Parallel execution cuts workflow latency significantly. If each tool takes 1 second and you call 5 tools in sequence, latency is 5 seconds. In parallel, it is 1 second plus overhead. Modern agent frameworks support tool call batching, allowing the model to request multiple tools in a single response. Your runtime then executes them concurrently and returns all results together, dramatically speeding workflows.

What Is Parallel Tool Execution?

Parallel tool execution is the ability to invoke multiple tools at the same time without waiting for one to complete before starting the next. In the agent loop, this works as follows:

  1. Model outputs multiple tool calls in a single response (instead of one per response).
  2. Runtime batches the execution: each tool runs in its own async task or thread.
  3. Results accumulate as they complete.
  4. All results are appended to the conversation at once.
  5. Model reads all results and decides what to do next.

The key difference from sequential execution: the model requests N tools, the runtime launches N tasks, and they run concurrently. The total time is the longest single tool call, not the sum of all calls.

Example: You have a travel agent that needs to fetch flight prices from 3 airlines. Sequential: 3 seconds (1s each). Parallel: 1 second (all 3 run at once). A 3× speedup with zero model hallucination risk—the model has already decided which tools to call, the runtime just executes faster.

How Models Output Multiple Tool Calls

Claude can output multiple tool calls in a single response. The response contains a list of tool_use blocks, each with a unique ID. Your framework must handle this:

import anthropic
import json
from concurrent.futures import ThreadPoolExecutor, as_completed

client = anthropic.Anthropic()

tools = [
{
"name": "fetch_flight_prices",
"description": "Fetch flight prices from an airline",
"input_schema": {
"type": "object",
"properties": {
"airline": {"type": "string", "enum": ["United", "Delta", "Southwest"]},
"origin": {"type": "string"},
"destination": {"type": "string"},
"date": {"type": "string", "format": "date"}
},
"required": ["airline", "origin", "destination", "date"]
}
}
]

def fetch_flight_prices(airline, origin, destination, date):
"""Simulate fetching prices."""
import random
return {
"airline": airline,
"price": round(random.uniform(100, 500), 2),
"departure": "10:00 AM",
"duration": "5h 30m"
}

# User request
messages = [
{"role": "user", "content": "Compare flights from NYC to LA on 2026-06-15 for United, Delta, and Southwest."}
]

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=tools,
messages=messages
)

# Extract all tool calls
tool_calls = [block for block in response.content if block.type == "tool_use"]
print(f"Model requested {len(tool_calls)} parallel tool calls")

# Execute in parallel
results = {}
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {}
for tool_call in tool_calls:
if tool_call.name == "fetch_flight_prices":
future = executor.submit(fetch_flight_prices, **tool_call.input)
futures[tool_call.id] = future

for tool_use_id, future in futures.items():
try:
results[tool_use_id] = future.result(timeout=5)
except Exception as e:
results[tool_use_id] = {"error": str(e)}

# Append all results to conversation
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": tool_use_id,
"content": json.dumps(result)
}
for tool_use_id, result in results.items()
]
})

# Model now sees all results and can compare
final_response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
tools=tools,
messages=messages
)

for block in final_response.content:
if hasattr(block, "text"):
print(block.text)

In this example, the model outputs 3 tool calls in a single response. The runtime launches 3 threads, all executing fetch_flight_prices concurrently. Results are collected and appended together. The model then reads all results and summarizes: "Delta is cheapest at $249, departing 10:00 AM." The entire workflow completes in ~1 second instead of 3.

Async Implementation (asyncio)

For I/O-bound tools (API calls, database queries), async/await is more efficient than threads:

import asyncio
import aiohttp

async def fetch_flight_prices_async(airline, origin, destination, date):
"""Async version using aiohttp."""
async with aiohttp.ClientSession() as session:
url = f"https://api.airline.com/prices?airline={airline}"
async with session.get(url) as resp:
data = await resp.json()
return data

async def execute_tools_parallel(tool_calls):
"""Execute all tool calls concurrently."""
tasks = []
for tool_call in tool_calls:
if tool_call.name == "fetch_flight_prices":
task = fetch_flight_prices_async(**tool_call.input)
tasks.append((tool_call.id, task))

results = {}
for tool_use_id, task in tasks:
try:
result = await task
results[tool_use_id] = result
except Exception as e:
results[tool_use_id] = {"error": str(e)}

return results

# In your agent loop:
tool_results = await execute_tools_parallel(tool_calls)

Async is preferred for I/O-bound tools because it requires fewer OS threads and has lower memory overhead. A single asyncio event loop can manage hundreds of concurrent API calls, whereas threading hits practical limits around 50–100 threads.

Coordination and Dependencies

Sometimes tools have dependencies: call A before B, and B needs the result from A. In that case, you cannot parallelize—execute sequentially. Parallel execution is only valid when tools are independent (no data dependencies).

Example: Fetching weather for 3 cities are independent. Fetching weather and then looking up pollen counts for that weather region are dependent and require sequential execution.

Strategy: Group tools by dependency level. Execute all level-1 tools in parallel. When all complete, execute level-2 tools in parallel, and so on.

def group_tools_by_dependency(tool_calls):
"""Group tool calls into independent levels."""
# Tool dependency graph
dependencies = {
"fetch_weather": [], # No dependencies
"fetch_inventory": [],
"lookup_pollen": ["fetch_weather"], # Depends on weather
"recommend_product": ["fetch_inventory"] # Depends on inventory
}

levels = []
remaining = set(tool_calls)

while remaining:
# Find tools with no unsatisfied dependencies
current_level = [
t for t in remaining
if all(dep in [x.name for x in levels[-1]] if levels else True
for dep in dependencies.get(t.name, []))
]
if not current_level:
raise ValueError("Circular dependency detected")

levels.append(current_level)
remaining -= set(current_level)

return levels

Limitations and Gotchas

Limit 1: Context size. Multiple tool results append to the conversation. If each result is large, context fills up fast. Monitor tokens; truncate large results if necessary.

Limit 2: Execution time. If tools are slow, the entire workflow is limited by the slowest tool (the critical path). Parallel execution helps only if tools have similar latency.

Limit 3: Rate limits. Parallel calls to the same API may hit rate limits. Consider batching or throttling concurrent requests.

Limit 4: Resource exhaustion. Too many concurrent operations (threads, async tasks) can exhaust file descriptors or memory. Use bounded executors (e.g., max_workers=10).

Comparison: Sequential vs. Parallel

MetricSequential (5 tools, 1s each)Parallel (same)Speedup
Total latency5s1s
Token usageHigher (multiple responses)Lower (single response batch)10–20% savings
Error recoveryPer-toolAll-or-nothingSequential is more granular
Model complexitySimpler (fewer calls)Slightly higher (track IDs)Negligible

Key Takeaways

  • Parallel tool execution cuts latency when tools are independent.
  • Models can output multiple tool calls in a single response; runtimes execute concurrently.
  • Use asyncio for I/O-bound tools; use ThreadPoolExecutor for CPU-bound.
  • Group tools by dependency level when parallelization is not possible.
  • Monitor context size, rate limits, and resource usage to avoid bottlenecks.

Frequently Asked Questions

Can the model decide when to parallelize?

The model does not explicitly choose parallelization—it simply outputs multiple tool calls if it determines they are independent. Your runtime then decides how to execute them (sequentially or in parallel). Most frameworks parallelize by default when multiple calls are present.

What if one parallel tool fails?

Catch the exception, return an error message for that tool's tool_use_id, and include successful results from other tools. The model reads mixed success/error and may retry the failed tool or work around it.

How many parallel tools is safe?

Start with 5–10. Monitor CPU, memory, and network usage. For API calls, check rate limits (many APIs allow 10–100 requests/sec). For database queries, check connection pool limits. Adjust concurrency based on observed bottlenecks.

Does parallel execution increase model hallucination?

No. The model decides which tools to call independently. Parallelization is a runtime optimization that does not affect the model's reasoning.

Should I use threads or async?

Use async for I/O-bound (network, database). Use threads for CPU-bound (math, compression). Python's asyncio is lighter-weight and more elegant for I/O; threading is simpler for CPU work. For most agent workloads (APIs, databases), async is preferred.

Further Reading