Skip to main content

LangGraph Checkpoints: Persistence & Recovery (2026)

Checkpoints are snapshots of agent state saved to a database. LangGraph's checkpoint system enables fault tolerance, cost recovery, and auditability: if your agent crashes mid-run, you resume from the last checkpoint without recomputing earlier steps. This article covers checkpointing strategies, backends, resume/replay patterns, and cost calculation for persistent agents.

Why Checkpoints Matter

An agent researching a topic makes 5-7 model calls. If the process crashes on call 6, without checkpoints you'd lose $0.50–1.00 recomputing calls 1–5. With checkpoints, you resume from call 5's output in seconds. For long-running agents (customer service, data analysis), the savings compound: a 50-call workflow that crashes 3 times saves $50+ per incident.

Checkpoints also enable:

  • Resumable Workflows: Pause after each step, inspect state, resume later.
  • Branching: From a checkpoint, fork and explore different paths (e.g., two different search strategies).
  • Audit Trails: Every state snapshot is logged; you can replay a run to debug.

Understanding Checkpoint Structure

A checkpoint stores:

  • Thread ID: A unique identifier for a conversation or run. Same thread_id = same conversation.
  • Timestamp: When the checkpoint was taken.
  • State: The full state dict (messages, tool results, memory).
  • Node: Which node just completed.

LangGraph automatically takes a checkpoint after each node completes. You don't need to write code; just configure a backend.

Setting Up Checkpointers

LangGraph supports multiple backends. For local development, SQLite is convenient. For production, use PostgreSQL or a managed service (like Upstash or Firestore).

SQLite (Local Development):

from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.graph import StateGraph

saver = SqliteSaver(conn_string="file:.langraph.sqlite")

graph = StateGraph(AgentState)
graph.add_node("research", research_node)
graph.add_node("analyze", analyze_node)
graph.add_edge("research", "analyze")
graph.set_entry_point("research")
graph.set_finish_point("analyze")

# Compile with checkpointer
compiled_graph = graph.compile(checkpointer=saver)

The SQLite database is created automatically. It stores all thread states, allowing you to resume any previous conversation.

PostgreSQL (Production):

from langgraph.checkpoint.postgres import PostgresSaver
import psycopg

# Create a connection pool (for production, use a connection manager)
conn = psycopg.connect("postgresql://user:password@localhost/langgraph")

saver = PostgresSaver(sync_connection=conn)

compiled_graph = graph.compile(checkpointer=saver)

# Close connection when done
conn.close()

PostgreSQL scales to millions of checkpoints and integrates with cloud platforms (AWS RDS, GCP Cloud SQL, Render, Supabase).

Running with Checkpoints

Every invoke() call must include a config dict with a thread_id. Same thread_id resumes from the last checkpoint; new thread_id starts fresh.

config = {"thread_id": "user-001-research-06-02"}

# First run: starts from entry point
result = compiled_graph.invoke(
{"query": "AI safety in 2026", "messages": []},
config=config
)

print("First run result:", result)

# Second invoke with same thread_id: resumes from last checkpoint
# If the first run crashed on the analyze node, the second invoke
# skips research and starts at analyze
result = compiled_graph.invoke(
{"query": "AI safety in 2026", "messages": []},
config=config
)

print("Resumed result:", result)

If the first invoke crashes after the research node, the second invoke automatically resumes from that checkpoint. You don't need to modify the initial state—LangGraph merges it with the persisted state.

Resuming from Checkpoints Programmatically

You can fetch a specific checkpoint and inspect or resume from it.

# Get all checkpoints for a thread
checkpoints = saver.list(config=config)

# Get a specific checkpoint
specific_checkpoint = saver.get(config=config, checkpoint_id="...")

# Resume from a checkpoint
# (Usually automatic, but you can be explicit)
result = compiled_graph.invoke(
{"query": "AI safety"},
config={
"thread_id": "user-001",
"checkpoint_id": specific_checkpoint["id"] # Resume from specific point
}
)

This is useful for audit/replay scenarios: load a historical checkpoint and trace how the agent reached a decision.

Replay and Debugging

Replay a run from the beginning to trace every decision:

# Fetch the first checkpoint of a thread
first_checkpoint = saver.get_checkpoint(config=config)

# Step through the graph, one node at a time
state_history = []
for event in compiled_graph.stream(
{"query": "AI safety"},
config=config
):
state_history.append(event)
print(f"After node: {event}")

# Inspect state evolution
for i, event in enumerate(state_history):
print(f"Step {i}: {event}")

Streaming allows you to observe every node execution and state update, useful for debugging complex workflows.

Cost Calculation and Recovery

To understand the financial impact of checkpoints:

# Assume Claude 3.5 Sonnet: $3/1M input tokens, $15/1M output tokens

tokens_per_step = {"research": 1500, "analyze": 2000, "synthesize": 1000}
input_rate = 0.000003 # $3 per 1M input tokens
output_rate = 0.000015 # $15 per 1M output tokens

# Calculate cost if the workflow runs once
total_cost_full_run = sum(
tokens * input_rate + tokens * output_rate * 2
for tokens in tokens_per_step.values()
)
print(f"Full run cost: ${total_cost_full_run:.2f}")

# If the workflow crashes on step 3 and we resume from checkpoint:
cost_recovered = (tokens_per_step["research"] + tokens_per_step["analyze"]) * input_rate
print(f"Cost recovered by checkpointing: ${cost_recovered:.4f}")

For a workflow costing $0.50 per full run, resuming from a mid-point checkpoint saves $0.30–0.40 per incident. Over thousands of runs, this adds up to significant savings.

Thread Management

Thread IDs should be unique per user/session/conversation. Bad thread ID strategy leads to checkpoints bleeding across users (a serious bug).

import uuid
from datetime import datetime

def generate_thread_id(user_id: str, task_name: str) -> str:
"""Generate a unique, readable thread ID."""
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
return f"{user_id}-{task_name}-{timestamp}"

# Good practice:
thread_id = generate_thread_id(user_id="user123", task_name="research")
config = {"thread_id": thread_id}

# Bad practice (non-unique):
config = {"thread_id": "research"} # Shared across all users!

Cleanup and Archive

Over time, checkpoints accumulate. For production systems, archive old threads and clean up completed runs.

from datetime import datetime, timedelta

# Archive threads older than 30 days
cutoff_date = datetime.now() - timedelta(days=30)

for thread_id in saver.list_threads():
# Implement cleanup logic
# (Exact API depends on your backend)
pass

SQLite databases can grow large (1 GB+ for millions of checkpoints). Archive to cold storage (S3) and delete from the hot database.

Key Takeaways

  • Checkpoints are automatic state snapshots; enable fault tolerance and cost recovery.
  • Thread IDs uniquely identify runs; same ID resumes from the last checkpoint.
  • SQLite is ideal for development; PostgreSQL for production.
  • Streaming lets you observe state evolution and debug complex workflows.
  • Cost recovery is significant: a $1 run that crashes halfway saves $0.50–0.70 per resumption.

Frequently Asked Questions

What happens if I invoke with a thread_id but don't have checkpoints enabled?

The invoke runs fresh every time. No state is persisted, so you can't resume. Always enable checkpointing for production.

Can I checkpoint to multiple backends simultaneously?

Not directly. Pick one backend per graph. If you need redundancy, use database replication (PostgreSQL replicas, S3 versioning).

What if I want to modify a checkpoint?

LangGraph doesn't provide direct checkpoint editing (to prevent corruption). Instead, load the checkpoint, manually update the state dict, and save a new checkpoint by resuming.

How long are checkpoints retained?

That depends on your retention policy. For SQLite, you decide when to archive. For PostgreSQL, set a retention period (e.g., 90 days) and run a cleanup job.

Can checkpoints leak sensitive data?

Yes. Checkpoints store the entire state, including API responses and intermediate results. Use encryption at rest (PostgreSQL encryption) and access controls. Sanitize sensitive data before checkpointing.

Further Reading