Skip to main content

Memory Summarization Techniques: Compressing History

As agents accumulate episodic records (potentially millions over years), storing and retrieving all of them becomes expensive. Summarization compresses history: rather than storing every conversation turn, the agent stores a brief summary capturing key facts, decisions, and outcomes. A well-designed summary retains 90% of decision-relevant information while reducing size by 70–90%.

The Summarization Tradeoff: Fidelity vs. Compression

Summarization trades detail for efficiency. Perfect fidelity (storing all records) is expensive; aggressive compression (storing only one fact per user) loses critical context. Production systems find a sweet spot: hierarchical summaries at multiple levels of granularity.

  • No compression: Store every event (high fidelity, O(n) retrieval cost, unbounded storage).
  • Aggressive summarization: One fact per user category (low fidelity, O(1) retrieval, minimal storage).
  • Tiered summarization (practical): Recent events (7 days) kept raw; older events summarized at 1-week granularity; very old events (>1 year) distilled into semantic facts.

A 2025 survey of production agents (LangChain community) found that tiered summarization achieved 7.3x compression while preserving 94% of context fidelity.

Single-Event Summaries: Abstracting Individual Episodes

The simplest summarization is per-episode: given a long conversation turn or complex action sequence, produce a one-sentence summary capturing the outcome. For example:

  • Long: "User asked for a report. Agent queried database. Result: 1,247 rows. User requested PDF format. Agent generated and sent PDF. User confirmed receipt. Total time: 3.2 seconds."
  • Summary: "User requested report in PDF format; agent delivered successfully."

Single-event summaries are useful for compacting verbose interactions. A summary function can be implemented with an LLM or rule-based extraction:

# Example: Event summary with Claude API (LLM-based)
import anthropic

def summarize_event(event: dict) -> str:
"""
Summarize a complex event into a brief, decision-relevant sentence.
"""
client = anthropic.Anthropic()

event_text = f"""
Event: {event['event_type']}
User input: {event['input_data']}
Agent output: {event['output_data']}
Metadata: {event.get('metadata', {})}
"""

message = client.messages.create(
model="claude-3.5-sonnet-20241022",
max_tokens=100,
messages=[{
"role": "user",
"content": f"Summarize this agent-user interaction in one sentence:\n{event_text}"
}]
)

return message.content[0].text


# Example: Rule-based event summary (faster, lighter)
def summarize_event_heuristic(event: dict) -> str:
"""Quick heuristic summary without LLM call."""
event_type = event.get("event_type", "unknown")
user_msg = event.get("input_data", {}).get("message", "")[:50]
agent_action = event.get("output_data", {}).get("action", "")
outcome = event.get("output_data", {}).get("result", "success")

return f"{event_type}: user request '{user_msg}...' → agent {agent_action} ({outcome})"

Session Summaries: Abstracting Multi-Turn Conversations

When a multi-turn conversation ends, summarize the entire session into a brief recap. A session summary captures:

  • Topic and goal
  • Key decisions or questions
  • Outcomes and action items
  • User preferences or constraints discovered
# Example: Session-level summarization
def summarize_session(session_events: list, model_call) -> dict:
"""
Summarize an entire user session (10–50 events).
model_call is a function that calls Claude or similar.
"""

# Concatenate event summaries
event_summaries = [summarize_event_heuristic(e) for e in session_events]
combined = "\n".join(event_summaries)

# Use LLM to extract key facts
response = model_call(f"""
Here's a transcript of a user-agent interaction.
Extract: (1) session goal, (2) key decisions, (3) outcome, (4) user preferences.
Be concise; use bullet points.

Transcript:
{combined}
""")

return {
"session_id": session_events[0].get("session_id"),
"goal": extract_field(response, "goal"),
"key_decisions": extract_field(response, "decisions"),
"outcome": extract_field(response, "outcome"),
"user_preferences": extract_field(response, "preferences"),
"event_count": len(session_events),
"created_at": datetime.now().isoformat()
}


def extract_field(text: str, field_name: str) -> str:
"""Simple extraction; in practice, use regex or JSON parsing."""
lines = text.split("\n")
for line in lines:
if field_name.lower() in line.lower():
return line.strip()
return ""

Hierarchical Summarization: Multi-Level Compression

For long-running agents, a single level of summarization isn't enough. Use a pyramid: raw events at the base, summaries at 1-day granularity, summaries at 1-week, summaries at 1-month. Each level links to the next, preserving the ability to zoom in if needed.

# Example: Hierarchical storage structure
class HierarchicalMemory:
def __init__(self, storage):
self.storage = storage # Database connection
# Levels: raw events, daily summaries, weekly summaries, monthly summaries
self.levels = ["raw", "daily", "weekly", "monthly"]

def add_event(self, event: dict):
"""Add a raw event."""
self.storage.insert("events_raw", event)

def create_daily_summary(self, user_id: str, date: str):
"""
Create a daily summary from raw events for a given date.
Called end-of-day.
"""
events = self.storage.query(
"events_raw",
where={"user_id": user_id, "date": date}
)

# Summarize all events from that day
session_summary = summarize_session(events, model_call=self.summarize_with_claude)

summary_record = {
"user_id": user_id,
"date": date,
"level": "daily",
"summary": session_summary,
"event_count": len(events),
"original_event_ids": [e["event_id"] for e in events]
}

self.storage.insert("summaries_daily", summary_record)

# Optionally archive/delete raw events older than a threshold
if self.days_old(date) > 30:
self.storage.delete("events_raw", where={"user_id": user_id, "date": date})

def create_weekly_summary(self, user_id: str, week_ending_date: str):
"""
Create a weekly summary from daily summaries.
Called at end of week.
"""
daily_summaries = self.storage.query(
"summaries_daily",
where={"user_id": user_id, "week_ending": week_ending_date}
)

# Summarize the summaries
combined = "\n".join([s["summary"] for s in daily_summaries])
weekly_summary = self.summarize_with_claude(
f"Condense these daily summaries into a weekly overview:\n{combined}"
)

summary_record = {
"user_id": user_id,
"week_ending": week_ending_date,
"level": "weekly",
"summary": weekly_summary,
"daily_summary_ids": [s["id"] for s in daily_summaries]
}

self.storage.insert("summaries_weekly", summary_record)

def summarize_with_claude(self, prompt: str) -> str:
"""Call Claude to summarize text."""
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-3.5-sonnet-20241022",
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text

Lossy Compression: What to Drop vs. Keep

Not all information is equally important. A smart summarization strategy drops low-signal data and preserves high-signal facts. Examples:

  • Drop: confirmation messages ("Understood"), repeated clarifications, system debug logs.
  • Keep: user requests, decisions, errors, outcomes, discovered preferences, action items.
def is_high_signal(event: dict) -> bool:
"""Heuristic to decide if an event should be preserved during compression."""
event_type = event.get("event_type", "")
content = str(event.get("content", "")).lower()

# Drop these
if event_type in ["confirmation", "ack", "debug_log"]:
return False

if "understood" in content or "acknowledged" in content:
return False

# Keep these
if event_type in ["user_request", "decision", "error", "outcome", "preference_learned"]:
return True

# Default: keep if it contains specific information
return len(content) > 20 # Arbitrary but practical threshold

Summary Quality Metrics

How do you know if a summary is good? Measure:

  1. Compression ratio: original_size / summary_size. Target 5–10x for daily summaries.
  2. Fidelity: Does the agent make the same decision with the summary as with the full history? Measure via A/B test or human review.
  3. Recall: Can the agent retrieve a needed fact from the summary? Spot-check: "Did the summary mention the user's format preference?"
def evaluate_summary_quality(original_events: list, summary: str, test_queries: list) -> dict:
"""
Measure summary quality on multiple dimensions.
test_queries: list of {"question": "...", "expected_answer": "..."}
"""
compression_ratio = sum(len(e["content"]) for e in original_events) / len(summary)

# Fidelity: can we answer questions from the summary?
correct_answers = 0
for query in test_queries:
# Simple: check if answer text appears in summary
if query["expected_answer"].lower() in summary.lower():
correct_answers += 1

recall_score = correct_answers / len(test_queries) if test_queries else 1.0

return {
"compression_ratio": compression_ratio,
"recall_score": recall_score, # 0–1, higher is better
"summary_length": len(summary),
"original_length": sum(len(e["content"]) for e in original_events)
}

Key Takeaways

  • Summarization compresses episodic history by 70–90% while preserving 90%+ of decision-relevant context.
  • Use a multi-level hierarchy: raw events (recent, 7 days), daily summaries, weekly summaries, monthly semantic facts.
  • Drop low-signal data (confirmations, debug logs); keep high-signal facts (requests, decisions, preferences, errors).
  • Measure summary quality via compression ratio, recall (can the agent answer questions from the summary?), and A/B testing.
  • Use LLM-based summarization for nuance; rule-based extraction for speed and cost efficiency.

Frequently Asked Questions

Should I summarize working memory or episodic memory?

Both. Summarize episodic memory for long-term storage (reduce database size). Summarize working memory when rolling over tasks (context rollup) to preserve intent. Don't summarize the current conversation—keep it raw for fidelity.

How often should I create summaries?

For working memory: after every 10–20 turns or when task completes. For episodic: daily summaries daily (at 11 PM), weekly summaries weekly, monthly summaries at month-end.

What if a summary is wrong or loses important details?

This is a real risk with aggressive compression. Mitigate: (1) use less aggressive compression (e.g., 3–5x instead of 10x), (2) keep original events for a reasonable window (7–30 days) so the agent can fetch details if needed, (3) human-review summaries in sensitive domains (medical, legal).

Can I use multi-choice or bullet-point summaries instead of prose?

Yes. Bullet points often work better for agents: clearer structure, easier to parse, denser information. Experiment with formats and measure recall.

Should I re-summarize as new information arrives?

Only for episodic summaries that haven't been archived. If you discover new facts (e.g., user corrections) weeks later, don't re-do past summaries—just note the correction as a new event.

Further Reading