Skip to main content

Advanced Tool Patterns: Context Windows and Caching

As workflows grow—more tools, more steps, more data—agents face a performance challenge: every tool call appends results to the conversation, inflating context. Long contexts mean slower responses, higher token costs, and reduced model focus. Advanced patterns like prompt caching, result compression, and intelligent batching help agents scale without sacrificing speed or coherence. Organizations using these patterns report 40–60% reduction in token usage and 2–3× faster workflows.

Pattern 1: Prompt Caching for Tool Definitions

Tool definitions rarely change. Caching them reduces token overhead on every API call.

import anthropic

client = anthropic.Anthropic()

# Define tools (stable, cacheable)
tools = [
{
"name": "search_web",
"description": "Search the public web for current information...",
"input_schema": {...}
},
# ... 49 more tools ...
]

# Format tools as a stable string (will be cached)
tools_context = """# Available Tools

## Data Retrieval (10 tools)
1. search_web: Search the public web
2. search_internal_docs: Search company documentation
... etc ...

## Computation (10 tools)
1. calculate: Perform arithmetic
... etc ...

## Business Actions (30 tools)
1. send_email: Send emails
... etc ...

Use these tools to complete your tasks."""

# Use prompt caching
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an agent with access to 50+ tools. Your goal is to help the user."
},
{
"type": "text",
"text": tools_context,
"cache_control": {"type": "ephemeral"} # This gets cached
}
],
tools=tools,
messages=[
{"role": "user", "content": "What is the weather in San Francisco?"}
]
)

# Second call with the same tools_context hits the cache
# Savings: ~1K tokens per call after first call

Prompt caching with cache_control: {"type": "ephemeral"} tells the API to cache the tool definitions. Subsequent calls with identical tool definitions reuse the cached version, reducing tokens by 20–30% per call. This is especially valuable for agents that make many tool calls in a session.

For frequently reused tool definitions, use persistent caching:

# Persistent cache (survives across sessions)
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": tools_context,
"cache_control": {"type": "ephemeral"} # For per-session reuse
},
{
"type": "text",
"text": "Stable agent persona and instructions",
"cache_control": {"type": "ephemeral"}
}
],
tools=tools,
messages=messages
)

Pattern 2: Result Compression

Large tool results inflate context. Compress them intelligently.

import json

class ResultCompressor:
@staticmethod
def summarize_data(data, max_tokens=500):
"""Compress a large result into a summary."""
if isinstance(data, str):
if len(data) > max_tokens * 4: # Rough estimate
return f"[Truncated: {len(data)} characters, showing first {max_tokens * 4}]\n" + data[:max_tokens * 4]
return data

elif isinstance(data, list):
if len(data) > 10:
summary = f"[List of {len(data)} items, showing first 5]\n"
return summary + json.dumps(data[:5], indent=2)
return json.dumps(data, indent=2)

elif isinstance(data, dict):
keys_count = len(data)
if keys_count > 20:
summary = f"[Dict with {keys_count} keys, showing 10]\n"
selected_keys = list(data.keys())[:10]
filtered = {k: data[k] for k in selected_keys}
return summary + json.dumps(filtered, indent=2)
return json.dumps(data, indent=2)

else:
return str(data)

# Usage
large_result = {
"users": [{"id": i, "name": f"User{i}", "email": f"user{i}@example.com"} for i in range(1000)]
}

compressed = ResultCompressor.summarize_data(large_result)
print(compressed)
# Output:
# [Dict with 1 keys, showing 1]
# {
# "users": [5 items shown]
# }

Compression is aggressive but effective. For a query that returns 1000 rows, showing the first 5 with a note is usually sufficient for the model to understand the data shape and decide next steps.

For analytics results, compress further:

def compress_analytics_result(result):
"""Compress analytics (statistics, aggregates) more aggressively."""

# Extract key metrics only
if "summary_stats" in result:
return {
"count": result["summary_stats"].get("count"),
"mean": result["summary_stats"].get("mean"),
"median": result["summary_stats"].get("median"),
"min": result["summary_stats"].get("min"),
"max": result["summary_stats"].get("max")
}

# If it's a table, show aggregate row counts and a few samples
if "rows" in result:
return {
"total_rows": len(result["rows"]),
"sample_rows": result["rows"][:3],
"columns": list(result["rows"][0].keys()) if result["rows"] else []
}

return result

Pattern 3: Batch Processing

Instead of calling a tool once per item, batch multiple items into one call.

def batch_search(queries: list, batch_size=5):
"""Search for multiple queries in batches."""

results = {}
for i in range(0, len(queries), batch_size):
batch = queries[i:i+batch_size]

# Call once with multiple queries
batch_result = search_tool(queries=batch)

for query, result in zip(batch, batch_result):
results[query] = result

return results

# Instead of 100 separate tool calls, use 20 batch calls
queries = [f"query_{i}" for i in range(100)]
results = batch_search(queries, batch_size=5)

Batching reduces API call overhead and round-trip latency. If a tool supports batch operations, use them.

Pattern 4: Conversation Summarization

Long conversations grow unbounded. Periodically summarize and prune history.

class ConversationManager:
def __init__(self, max_tokens=4000):
self.messages = []
self.max_tokens = max_tokens
self.token_count = 0

def add_message(self, message, tokens):
"""Add a message and track tokens."""
self.messages.append(message)
self.token_count += tokens

if self.token_count > self.max_tokens:
self.summarize_and_prune()

def summarize_and_prune(self):
"""Summarize old messages and remove them from history."""

if len(self.messages) < 10:
return # Not enough messages to summarize

# Keep recent 10 messages, summarize the rest
old_messages = self.messages[:-10]
recent_messages = self.messages[-10:]

# Summarize old messages (simplified; in practice, use an LLM)
summary = {
"role": "assistant",
"content": f"[Summary: {len(old_messages)} messages processed. Key actions: {self._extract_actions(old_messages)}]"
}

# Replace old messages with summary
self.messages = [summary] + recent_messages

# Recalculate token count (rough estimate)
self.token_count = len(self.messages) * 50 # Approximate

def _extract_actions(self, messages):
"""Extract key actions from messages."""
actions = []
for msg in messages:
if "tool_use" in str(msg):
actions.append("tool called")
return ", ".join(set(actions)) or "data processing"

# Usage
manager = ConversationManager(max_tokens=4000)
manager.add_message({"role": "user", "content": "..."}, tokens=100)
# ... more messages ...
manager.add_message({"role": "assistant", "content": "..."}, tokens=200)

# When token count exceeds limit, old messages are summarized
messages = manager.messages

Conversation summarization keeps context bounded and prevents exponential token growth.

Pattern 5: Intelligent Tool Filtering by Request

Before each tool call, filter the available tools based on the request context. This reduces context and improves selection.

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class SmartToolSelector:
def __init__(self, tools):
self.tools = tools
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
self.tool_embeddings = {
tool["name"]: self.embedding_model.encode(tool["description"])
for tool in tools
}

def select_tools_for_request(self, user_request, top_k=10):
"""Select the most relevant K tools for this request."""

request_embedding = self.embedding_model.encode(user_request)

similarities = []
for tool_name, tool_embedding in self.tool_embeddings.items():
sim = cosine_similarity([request_embedding], [tool_embedding])[0][0]
similarities.append((tool_name, sim))

top_tools = sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]
tool_names = [name for name, _ in top_tools]

selected_tools = [t for t in self.tools if t["name"] in tool_names]
return selected_tools

# Usage
selector = SmartToolSelector(all_50_tools)
user_request = "Find the latest earnings reports for Tesla and Microsoft"

filtered_tools = selector.select_tools_for_request(user_request, top_k=5)
# Only 5 tools sent to the model instead of 50

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=filtered_tools, # Reduced set
messages=[{"role": "user", "content": user_request}]
)

Pattern 6: Tool Result Caching

Some tools return the same data repeatedly (e.g., currency rates, stock prices). Cache results to avoid re-fetching.

import time
from functools import wraps

class ToolResultCache:
def __init__(self, ttl_seconds=300):
self.cache = {}
self.ttl_seconds = ttl_seconds

def cache_tool(self, ttl_override=None):
"""Decorator to cache tool results."""
ttl = ttl_override or self.ttl_seconds

def decorator(func):
def wrapper(*args, **kwargs):
# Create a cache key from arguments
cache_key = (func.__name__, tuple(args), tuple(sorted(kwargs.items())))

# Check cache
if cache_key in self.cache:
cached_result, timestamp = self.cache[cache_key]
if time.time() - timestamp < ttl:
return cached_result

# Cache miss: execute and cache
result = func(*args, **kwargs)
self.cache[cache_key] = (result, time.time())
return result

return wrapper
return decorator

# Usage
cache = ToolResultCache(ttl_seconds=300)

@cache.cache_tool(ttl_override=600) # 10 minute cache
def fetch_stock_price(symbol):
"""Fetch stock price (cached for 10 minutes)."""
import random
return {"symbol": symbol, "price": random.uniform(100, 500)}

# First call: fetches from API
price1 = fetch_stock_price("TSLA")

# Second call (within 10 minutes): returns cached result
price2 = fetch_stock_price("TSLA")

print(price1 == price2) # True

Tool result caching is particularly effective for tools that fetch stable or slow-changing data.

Comparison: Token Savings with Advanced Patterns

OptimizationToken SavingsLatency Impact
Prompt caching20–30% per callNone
Result compression40–50%Minimal (model sees less detail)
Batch processing30–40% (fewer calls)Reduced (batched)
Conversation summarization50–70% (long sessions)Minimal
Tool filtering10–20%Positive (faster selection)
Tool result caching70–90% (repeated queries)Positive (cache hit)

Combined, these patterns can reduce token usage by 50–75% in typical agent workloads.

Key Takeaways

  • Use prompt caching to cache stable tool definitions and instructions.
  • Compress large tool results to show summaries instead of full data.
  • Batch multiple items into single tool calls to reduce call overhead.
  • Summarize old conversation history when context grows large.
  • Dynamically filter tools for each request to reduce cognitive load.
  • Cache tool results for data that changes slowly.

Frequently Asked Questions

What is the ideal tool result size?

500–2000 tokens is ideal. Smaller and you lose useful detail; larger and context bloats. Compress aggressively beyond 2000 tokens.

Should I always use prompt caching?

Yes, if your tool definitions are stable. The first call pays a small cache-creation overhead; all subsequent calls save 20–30%. Worthwhile even for short sessions.

Can I cache across different users?

Not safely. Tool definitions can be shared, but user data and conversation history must be isolated. Cache only public, user-agnostic content.

How often should I summarize conversations?

When token count exceeds 60–70% of your limit. If limit is 4000 tokens, summarize when you hit 2500.

What if batching breaks tool contracts?

Some tools require one call per item. Do not force batching if the tool does not support it. Design tools with batching in mind from the start.

Further Reading