Skip to main content

Orchestrating Dozens of Tools: Scale Agent Complexity

A single agent with 5 tools is manageable. With 50 tools, the agent becomes confused. With 100+ tools, it barely functions. Orchestrating dozens of tools requires a different architecture: stratification by category, smart tool routing, hierarchical state management, and explicit workflow definitions. The difference between a brittle 20-tool agent and a scalable 100-tool agent is structure.

The Complexity Problem

As tools increase, several problems compound:

  1. Cognitive load: The model must reason about too many options. Tool selection accuracy drops from 95% (5 tools) to 40% (100 tools).
  2. Context inflation: Listing all tools consumes context tokens. With 100 tools, descriptions alone can use 10K+ tokens.
  3. Ambiguity: Similar tools (query_db, search_db, fetch_from_db) confuse the model.
  4. State sprawl: Tracking which tools have been called, what data they returned, and what remains to be done becomes chaotic.

The solution is stratification: organize tools hierarchically, route the model to the right layer, and manage state explicitly.

Architecture: Hierarchical Tool Organization

Organize tools into layers:

Layer 0: Decision maker. The model decides which task category to use (Data Retrieval, Computation, Business Actions, etc.).

Layer 1: Category selector. Given a category, the model selects from a curated subset (e.g., 5–10 tools).

Layer 2: Specific tool. The model calls a specific tool with concrete arguments.

tool_hierarchy = {
"data_retrieval": {
"description": "Fetch information from various sources",
"tools": ["search_web", "search_internal_docs", "query_database", "fetch_api"],
"routing_hints": {
"search_web": "Use for current public information, news, trends",
"search_internal_docs": "Use for company knowledge, policies",
"query_database": "Use for historical data, internal records",
"fetch_api": "Use for real-time data from partner systems"
}
},
"computation": {
"description": "Perform calculations and analysis",
"tools": ["calculate", "analyze_statistics", "run_simulation", "ml_predict"],
"routing_hints": {
"calculate": "Simple arithmetic, algebra, formulas",
"analyze_statistics": "Descriptive stats, correlations, distributions",
"run_simulation": "Monte Carlo, forecasting, scenario analysis",
"ml_predict": "Machine learning inference"
}
},
"business_actions": {
"description": "Take action in business systems",
"tools": ["send_email", "create_ticket", "update_record", "approve_workflow"],
"routing_hints": {
"send_email": "Communicate with humans",
"create_ticket": "Track issues and tasks",
"update_record": "Modify business data",
"approve_workflow": "Authorize decisions"
}
}
}

The model first chooses a category ("I need to fetch data"), then a specific tool within that category ("search_web for current info"). This two-stage decision reduces cognitive load dramatically.

System Prompt Engineering for Scale

A carefully crafted system prompt guides the model through layered decisions:

system_prompt = """You are an agent with access to 50+ tools organized into categories.

TOOL CATEGORIES (choose one based on your task):
1. Data Retrieval: Fetch information from external sources
- search_web: Current public information, news, web content
- search_internal_docs: Company knowledge base, policies
- query_database: Historical data, internal records
- fetch_api: Real-time data from partner systems

2. Computation: Analyze data and perform calculations
- calculate: Arithmetic, algebra, math expressions
- analyze_statistics: Descriptive stats, correlations
- run_simulation: Forecasting, scenario analysis
- ml_predict: ML model inference

3. Business Actions: Modify systems and take action
- send_email: Communicate with humans
- create_ticket: Track issues
- update_record: Modify business data
- approve_workflow: Authorize decisions

DECISION PROCESS:
1. Understand the user's request
2. Determine which category is most relevant (Data Retrieval, Computation, or Business Actions)
3. Within that category, choose the specific tool that best matches the task
4. Call the tool with precise arguments

EXAMPLE:
User: "What are the latest AI trends?"
→ Category: Data Retrieval (need current info)
→ Tool: search_web (for recent public information)
→ Call: search_web(query="latest AI trends 2026")

User: "Calculate the IRR of this investment"
→ Category: Computation (math analysis)
→ Tool: calculate (using IRR formula)
→ Call: calculate(expression="IRR(cf_0=-1000, cf_1=300, cf_2=400, cf_3=500)")

Now, handle the user's request by choosing the right category and tool."""

This prompt is longer than typical, but it dramatically improves accuracy. The model sees examples, categories, and decision logic all in one place.

Dynamic Tool Discovery with Embeddings

Instead of listing all 50 tools, use embeddings to find the top K relevant ones:

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class ToolFinder:
def __init__(self, tool_hierarchy):
self.tool_hierarchy = tool_hierarchy
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
self.tool_embeddings = {}

# Pre-embed all tools
for category, cfg in tool_hierarchy.items():
for tool_name in cfg["tools"]:
hint = cfg["routing_hints"].get(tool_name, "")
description = f"{tool_name}: {hint}"
embedding = self.embedding_model.encode(description)
self.tool_embeddings[tool_name] = embedding

def find_relevant_tools(self, user_request, top_k=5):
"""Find the top-K most relevant tools for a request."""
request_embedding = self.embedding_model.encode(user_request)

similarities = {}
for tool_name, tool_embedding in self.tool_embeddings.items():
sim = cosine_similarity([request_embedding], [tool_embedding])[0][0]
similarities[tool_name] = sim

top_tools = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_k]
return [name for name, _ in top_tools]

# Usage
finder = ToolFinder(tool_hierarchy)
relevant = finder.find_relevant_tools("Calculate ROI of a marketing campaign")
print(relevant) # E.g., ['calculate', 'analyze_statistics', 'ml_predict']

# Only send these 3 tools to the model, not all 50

This reduces context overhead and improves tool selection accuracy.

Explicit State Machine for Complex Workflows

For workflows with many steps, define an explicit state machine:

from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, Any, List

class WorkflowPhase(Enum):
REQUIREMENTS = "requirements"
PLANNING = "planning"
EXECUTION = "execution"
VERIFICATION = "verification"
COMPLETION = "completion"

@dataclass
class WorkflowState:
phase: WorkflowPhase
user_request: str
plan: str = ""
executed_tools: List[str] = field(default_factory=list)
results: Dict[str, Any] = field(default_factory=dict)
errors: List[str] = field(default_factory=list)

def transition_to(self, next_phase: WorkflowPhase):
valid_transitions = {
WorkflowPhase.REQUIREMENTS: [WorkflowPhase.PLANNING],
WorkflowPhase.PLANNING: [WorkflowPhase.EXECUTION],
WorkflowPhase.EXECUTION: [WorkflowPhase.VERIFICATION],
WorkflowPhase.VERIFICATION: [WorkflowPhase.COMPLETION]
}

if next_phase not in valid_transitions.get(self.phase, []):
raise ValueError(f"Cannot transition from {self.phase} to {next_phase}")

self.phase = next_phase

def orchestrate_complex_workflow(user_request: str, tools_config):
"""Orchestrate a complex multi-phase workflow."""

state = WorkflowState(phase=WorkflowPhase.REQUIREMENTS, user_request=user_request)
messages = [{"role": "user", "content": user_request}]

while state.phase != WorkflowPhase.COMPLETION:
# Build phase-specific prompt
system_prompt = build_phase_prompt(state, tools_config)

# Get model response
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=system_prompt,
tools=get_tools_for_phase(state.phase, tools_config),
messages=messages
)

# Process response and update state
if response.stop_reason == "end_turn":
if state.phase == WorkflowPhase.VERIFICATION:
state.transition_to(WorkflowPhase.COMPLETION)
else:
state.transition_to(get_next_phase(state.phase))

# Execute tool calls
for block in response.content:
if block.type == "tool_use":
state.executed_tools.append(block.name)
result = execute_tool(block.name, block.input, tools_config)
state.results[block.name] = result

# Append to history
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
}]
})

return state

def build_phase_prompt(state: WorkflowState, tools_config):
"""Build a prompt specific to the current workflow phase."""

if state.phase == WorkflowPhase.REQUIREMENTS:
return """Analyze the user's request. What information do you need to understand the problem?
Output: A brief summary of requirements and any clarifying questions."""

elif state.phase == WorkflowPhase.PLANNING:
return f"""Plan how to solve this problem. You know the requirements:
{state.user_request}

Outline the steps you will take using available tools. Output: A numbered plan."""

elif state.phase == WorkflowPhase.EXECUTION:
return f"""Execute your plan. Call tools as needed. Plan: {state.plan}"""

elif state.phase == WorkflowPhase.VERIFICATION:
results_summary = "\n".join([f"{k}: {v}" for k, v in state.results.items()])
return f"""Verify the results. Results so far:
{results_summary}

Output: A summary of what was accomplished and any gaps."""

else:
return "Complete the task."

This state machine makes the workflow explicit. The model understands what phase it is in and what it should focus on.

Comparison: Flat vs. Hierarchical Architecture

AspectFlat (50 tools)Hierarchical (50 tools)
Tool selection accuracy40–50%85–95%
Context tokens (tools)15K+5K (top 10)
Decision latencyHigherLower
User complexityHigh (confusing)Low (guided)
MaintainabilityDifficultClear structure

Key Takeaways

  • Stratify tools into categories and subcategories to reduce cognitive load.
  • Use embedding-based tool discovery to reduce context overhead.
  • Write phase-specific system prompts that guide the model through complex workflows.
  • Implement explicit state machines to track progress and prevent loops.
  • Test tool selection accuracy at scale; aim for >85%.

Frequently Asked Questions

How many tools should I put in each category?

5–10 is ideal. Beyond that, cognitive load increases. If you have more, sub-categorize further (e.g., Data Retrieval → Web APIs, Internal DBs, External Services).

Should I use the embedding-based filtering always?

Yes, for 20+ tools. The overhead is minimal (embedding is fast), and the accuracy gain is significant. For <10 tools, simpler approaches (descriptions + examples) are sufficient.

What if the model chooses the wrong category?

Embed this error-catching in the system prompt: "If none of these categories seem right, explain why and ask for clarification." The model will signal confusion rather than hallucinate.

How deep can state machines be?

Start with 4–5 phases (requirements, planning, execution, verification, completion). Beyond that, consider breaking the workflow into sub-workflows or simplifying the process.

Can I test tool selection without a real model?

Yes. Create a mock model that randomly picks tools. Measure accuracy and identify which tools are frequently misidentified. This highlights areas for refinement (better descriptions, different groupings).

Further Reading