Working Memory for Agents: Short-Term Processing
Working memory is the agent's scratch space: the conversation history, intermediate reasoning steps, and current task state that fit within the model's context window for the next inference. Unlike episodic and semantic memory (which persist across sessions), working memory is ephemeral—it is discarded or summarized after the task ends. However, designing working memory well is critical for multi-turn reasoning and cost control.
What Is Working Memory and How It Operates
Working memory holds the active conversation state that the agent considers during each reasoning step. It includes the user's current message, prior turns (limited by token count), the agent's scratchpad for planning, and retrieved context from episodic or semantic stores. Working memory is bounded—typically 2,000–8,000 tokens—because including the full message history from weeks of interaction would exceed context limits and inflate inference cost.
The working memory lifecycle follows this pattern: (1) incoming user message triggers memory initialization; (2) agent retrieves relevant episodic or semantic facts (if any) into working memory; (3) agent reasons and produces output; (4) agent appends user message and response to episodic store; (5) working memory is cleared or summarized. For multi-turn tasks, steps 2–4 repeat without clearing working memory until the task completes.
According to a 2025 Anthropic analysis of production agents, improper working memory management accounts for 18% of agent failures: incomplete context (agent forgetting mid-conversation), token budget overrun (expensive redundant inference), or context contamination (agent confusing past turns with current data). Well-designed working memory eliminates these failure modes.
Working Memory Capacity and Token Budgeting
The first design decision is capacity. Most production agents reserve 30–50% of the model's context window for working memory and fill it with the most recent conversation turns. A model with a 200K context window might allocate 60K–100K tokens to working memory, leaving 100K–140K for retrieved facts, system prompts, and reasoning.
Token budgeting within working memory follows one of two strategies: FIFO (First-In-First-Out) or priority-weighted. FIFO is simple: keep the most recent N messages until capacity is reached, then discard the oldest. Priority weighting is smarter: preserve high-signal turns (user questions, agent decisions) and discard low-signal turns (confirmations, clarifications) first.
# Example: FIFO-based working memory with token budgeting
class WorkingMemory:
def __init__(self, max_tokens=8000, model_name="claude-3.5-sonnet"):
self.messages = [] # List of {"role": role, "content": content}
self.max_tokens = max_tokens
self.scratchpad = {} # Intermediate reasoning state
self.model_name = model_name
def _count_tokens(self, messages):
"""Approximate token count (4 chars ~= 1 token as rough estimate)."""
total = sum(len(msg["content"]) for msg in messages)
return total // 4
def add_message(self, role, content, priority="normal"):
"""Add a message; evict oldest if capacity exceeded."""
self.messages.append({
"role": role,
"content": content,
"priority": priority
})
# Enforce token limit using FIFO (remove lowest priority first)
while self._count_tokens(self.messages) > self.max_tokens:
# Remove the oldest low-priority message
for i, msg in enumerate(self.messages):
if msg["priority"] == "low":
self.messages.pop(i)
break
else:
# If no low-priority, remove oldest overall
self.messages.pop(0)
def get_context(self):
"""Return messages for the next inference."""
return [{"role": m["role"], "content": m["content"]}
for m in self.messages]
def clear(self):
"""Clear working memory after task completion."""
self.messages = []
self.scratchpad = {}
Multi-Turn Conversation Management
For multi-turn tasks (e.g., a user asking three sequential questions), working memory must maintain coherence across turns while staying within token budget. A naive approach would keep every message; this scales poorly. A better approach keeps a sliding window: preserve the current and prior N messages, and beyond that, summarize or discard.
One production pattern is the "context rollup": after every 5–10 turns, the agent creates a brief summary of prior conversation and replaces the old messages with the summary. This preserves intent and outcome while reducing token count by 60–80%.
# Example: Context rollup with summarization
def rollup_context(working_memory, summary_model_call):
"""
Summarize old messages and replace with one compact summary.
summary_model_call is a function that invokes Claude or similar
to produce a brief summary string.
"""
if len(working_memory.messages) < 10:
return # Not enough history yet
old_messages = working_memory.messages[:-5] # All but last 5
recent = working_memory.messages[-5:]
# Summarize old messages
old_content = "\n".join([m["content"] for m in old_messages])
summary = summary_model_call(
f"Summarize this conversation concisely:\n{old_content}"
)
# Replace old messages with one summary
working_memory.messages = [
{"role": "system", "content": f"Prior conversation summary:\n{summary}"}
] + recent
Scratchpad for Intermediate Reasoning
Many agents use a scratchpad—a structured dictionary within working memory—to track intermediate reasoning, tool call results, and decision branches. A scratchpad prevents redundant reasoning: if the agent already looked up a user's email once, it notes the result in the scratchpad rather than repeating the lookup.
# Example: Using scratchpad to cache intermediate results
class WorkingMemory:
def __init__(self, max_tokens=8000):
self.messages = []
self.scratchpad = {
"user_lookups": {}, # user_id -> {email, name, ...}
"tool_cache": {}, # tool_name + params -> result
"decisions": [], # Log of key decisions
"current_step": None # Current task step
}
self.max_tokens = max_tokens
def cache_lookup(self, lookup_type, key, value):
"""Store a lookup result in scratchpad."""
if lookup_type not in self.scratchpad:
self.scratchpad[lookup_type] = {}
self.scratchpad[lookup_type][key] = value
def get_cached(self, lookup_type, key):
"""Retrieve cached value, avoiding redundant tool calls."""
if lookup_type in self.scratchpad:
return self.scratchpad[lookup_type].get(key)
return None
Handling Long Context Degradation
A well-documented phenomenon in LLMs is "lost in the middle": when relevant information appears deep in the context window (neither at the start nor the end), the model may overlook it. For working memory, this means recent messages (recency bias) and the system prompt (primacy bias) are weighted heavily; middle messages risk being ignored.
A mitigation strategy is to re-signal critical information at the end of working memory before inference:
def prepare_inference_context(working_memory):
"""
Prepare context for inference, emphasizing critical information.
"""
context = working_memory.get_context()
# Append a summary of key facts at the end to counteract lost-in-middle
critical_facts = "\n".join([
f"CRITICAL: {msg['content'][:100]}"
for msg in context
if msg.get("priority") == "critical"
])
if critical_facts:
context.append({
"role": "system",
"content": f"Remember these key points:\n{critical_facts}"
})
return context
Decision: When to Summarize vs. Archive
As tasks grow longer, agents must decide when to summarize working memory and push old turns to episodic storage. The tradeoff is latency vs. cost: keeping all history in working memory is fast but expensive; summarizing early saves cost but risks losing nuance.
A practical rule is to summarize when the number of turns exceeds 20, or when token count of working memory exceeds 60% of your budget. For real-time agents (chat), use FIFO eviction. For batch agents (email workflows, report generation), use context rollup after task completion.
Key Takeaways
- Working memory is the bounded, ephemeral context (2K–10K tokens) that fits in the model's context window for each inference.
- Allocate 30–50% of your context window to working memory; use token budgeting (FIFO or priority-weighted eviction) to stay within budget.
- Implement a scratchpad to cache tool call results, preventing redundant computation and reducing token waste.
- Use context rollup (summarization) after every 10–15 turns to preserve intent while reducing memory footprint.
- Mitigate lost-in-the-middle degradation by re-signaling critical facts at the end of the context window.
Frequently Asked Questions
How do I know if my working memory size is too small?
If the agent repeats questions or loses context mid-task, increase capacity. If inference is slow or expensive, decrease capacity. Start with 20% of your context window and adjust based on task failure patterns.
Can I keep all conversation history in working memory?
Only for short conversations (under 2K tokens total). For longer tasks, use sliding window or rollup. Most production agents keep 5–10 turns in working memory and archive the rest.
Should I include system prompts in working memory token count?
Yes. System prompts typically consume 500–2,000 tokens; factor them in. Some teams allocate capacity as: 20% system, 30% working memory, 50% retrieved facts/reasoning.
What happens if context rollup misses important details?
This is a real risk. Mitigate by: (1) human-reviewing rollup summaries in sensitive workflows, (2) keeping detailed episodic records so the agent can retrieve details if needed later, (3) using two models—a fast cheap one for rollup, a more capable one for critical reasoning.
How do I test if working memory is working correctly?
Unit tests: verify that messages are evicted in the right order. Integration tests: run multi-turn conversations and check that the agent doesn't repeat questions or contradict prior statements. Manual testing: review sample conversations and spot-check for coherence.
Further Reading
- LangChain ConversationBufferWindowMemory — reference implementation of FIFO working memory.
- OpenAI: Managing Conversation Length in Chat Models (2024) — production guidance on context budgeting.
- Microsoft Research: Lost in the Middle (2023) — analysis of context window degradation and mitigations.
- Claude Documentation: Context Windows and Token Counting — practical token budgeting for Claude models.