Long-Horizon Task Planning: Multi-Step Objectives Over Time
Long-horizon planning is the challenge of orchestrating 20, 50, or 100+ steps over hours or days toward a distant goal. Naive approaches fail: a single task graph with 100 nodes exceeds LLM context; a single ReAct loop with 100 steps accumulates errors at each step. Production systems use hierarchical planning: break the long journey into distinct stages, plan each stage locally, and use checkpoints to prevent error compounding.
The Hierarchy: Goals → Stages → Tasks
Instead of one flat plan, use three levels:
- Goal (1–2 months): "Migrate our entire data pipeline to a modern warehouse."
- Stages (1–2 weeks each): Audit, design, set up, migrate, validate, train.
- Tasks (1–4 hours each): Specific subtasks within a stage.
This structure allows you to reason about stages in a high-level plan (6–10 nodes, fitting easily in context), then plan each stage independently (20–30 tasks per stage, manageable).
@dataclass
class LongHorizonPlan:
"""A multi-stage plan for a long-horizon goal."""
goal: str
stages: List[Stage]
checkpoints: List[Checkpoint]
total_estimated_duration: float # hours
@dataclass
class Stage:
"""A phase of the long-horizon goal."""
id: str
name: str
description: str
goal: str
prerequisite_stages: List[str] # IDs of stages that must complete first
estimated_duration_hours: float
success_criteria: List[str]
tasks: List[Task] # Plan for this stage; generated at execution time
@dataclass
class Checkpoint:
"""Validation point between stages."""
id: str
stage_id: str
check: str # What to verify
success_criteria: str
remediation: str # What to do if check fails
Here's how the planning and execution flow:
class LongHorizonAgent:
def __init__(self, llm_client):
self.llm = llm_client
self.completed_stages = set()
self.stage_results = {}
self.max_retries_per_stage = 2
def plan_long_horizon(self, goal: str, horizon_days: int) -> LongHorizonPlan:
"""Generate a multi-stage plan for a long-horizon goal."""
plan_prompt = f"""
You are a planning expert. Your goal is to break a long-horizon objective into stages.
GOAL: {goal}
HORIZON: {horizon_days} days
Break this into 4-6 major stages. Each stage should:
1. Represent 1-2 weeks of work
2. Have clear success criteria
3. Depend on previous stages' outputs
4. Be independently executable
FORMAT AS JSON:
{{
"stages": [
{{
"id": "stage_1",
"name": "Stage Name",
"description": "What this stage accomplishes",
"goal": "Specific goal for this stage",
"prerequisite_stages": [],
"estimated_duration_hours": 40,
"success_criteria": ["Criterion 1", "Criterion 2"]
}}
],
"rationale": "Why this stage breakdown makes sense"
}}"""
plan = self.llm.completion(plan_prompt, response_format="json")
return LongHorizonPlan(
goal=goal,
stages=[Stage(**s) for s in plan["stages"]],
checkpoints=self.generate_checkpoints(plan["stages"])
)
def execute_stage(self, stage: Stage) -> dict:
"""Execute a single stage with detailed task planning."""
# STEP 1: Detailed task planning for this stage
task_prompt = f"""
You are executing this stage of a larger project:
STAGE GOAL: {stage.goal}
STAGE DESCRIPTION: {stage.description}
SUCCESS CRITERIA: {stage.success_criteria}
Break this stage into 15-30 concrete tasks. Each task should:
1. Be executable in 1-4 hours
2. Have clear inputs and outputs
3. Respect dependencies
FORMAT AS JSON with field "tasks": [...]"""
stage_plan = self.llm.completion(task_prompt, response_format="json")
stage.tasks = stage_plan["tasks"]
# STEP 2: Execute tasks
results = {}
for task in stage_plan["tasks"]:
try:
result = self.execute_task_with_retry(task, results)
results[task["id"]] = result
except TaskFailure as e:
# Log failure; let stage checkpoint decide if stage fails
results[task["id"]] = {"status": "failed", "error": str(e)}
# STEP 3: Validate stage outputs
validation_result = self.validate_stage(stage, results)
if not validation_result["passed"]:
if self.stage_retry_count < self.max_retries_per_stage:
self.stage_retry_count += 1
return self.execute_stage(stage) # Retry entire stage
else:
raise StageFailure(validation_result["reason"])
self.completed_stages.add(stage.id)
self.stage_results[stage.id] = results
return {"status": "success", "results": results}
def execute_task_with_retry(self, task: dict, context: dict) -> str:
"""Execute a single task with 2 retries."""
for attempt in range(3):
try:
# Fetch inputs from prior tasks
inputs = {dep: context[dep] for dep in task.get("depends_on", [])}
result = executor.run_task(task, inputs)
# Validate result
if not self.validate_task_output(task, result):
if attempt < 2:
continue # Retry
else:
raise TaskFailure("Output validation failed after 3 attempts")
return result
except Exception as e:
if attempt == 2:
raise TaskFailure(f"Task {task['id']} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff
def validate_stage(self, stage: Stage, results: dict) -> dict:
"""Check if stage outputs meet success criteria."""
for criterion in stage.success_criteria:
validation_prompt = f"""
Does this stage output satisfy the criterion?
CRITERION: {criterion}
STAGE OUTPUTS: {results}
Reply: passed/failed with brief reason."""
response = self.llm.completion(validation_prompt)
if "failed" in response.lower():
return {"passed": False, "reason": response}
return {"passed": True}
def run_long_horizon(self, goal: str, horizon_days: int) -> dict:
"""Execute a complete long-horizon plan."""
# Plan the entire arc
plan = self.plan_long_horizon(goal, horizon_days)
# Execute stages in dependency order
all_results = {}
for stage in topological_sort(plan.stages):
try:
stage_results = self.execute_stage(stage)
all_results[stage.id] = stage_results
except StageFailure as e:
# Stage failed; escalate to human
return {
"status": "human_escalation_needed",
"failed_stage": stage.id,
"reason": str(e),
"results_so_far": all_results
}
return {
"status": "success",
"goal": goal,
"stages_completed": len(plan.stages),
"all_results": all_results
}
Preventing State Drift
The biggest risk in long-horizon planning: state drifts. You compute a plan assuming Service X is running, but 3 days later Service X is down, invalidating all downstream assumptions.
Mitigation:
def save_stage_checkpoint(stage_id: str, results: dict):
"""Save stage results for recovery if later stages fail."""
checkpoint = {
"timestamp": time.time(),
"stage_id": stage_id,
"results": results,
"world_state": {
"running_services": get_active_services(),
"resource_limits": get_resource_status(),
"data_snapshots": get_data_snapshots()
}
}
save_to_disk(f"checkpoint_{stage_id}.json", checkpoint)
If a later stage fails and needs to replan, you can load this checkpoint and tell the LLM: "Here's what the world looked like when Stage 1 completed; what's changed?"
Handling Multi-Day Execution
For goals spanning multiple days, plan conservatively:
def plan_with_uncertainty(goal: str, horizon_days: int):
"""Account for unknown unknowns in long-horizon planning."""
prompt = f"""
GOAL: {goal}
HORIZON: {horizon_days} days
Break into stages, but assume:
- 20% of tasks will fail and need rework.
- External systems may go offline, requiring workarounds.
- New information will arrive mid-project, requiring pivots.
Build in slack: for each stage, add a "buffer tasks" subtask
that catches unforeseen issues.
FORMAT: JSON with "stages": [...] and "contingency_planning": {...}"""
plan = llm.completion(prompt, response_format="json")
return plan
This is honest: long-term plans are inherently uncertain. Building that into the plan—explicit "handle unexpected" tasks—is more robust than pretending perfect knowledge.
Key Takeaways
- Long-horizon planning requires hierarchy: Goals → Stages → Tasks.
- Plan stages at high level (6–10 nodes); plan tasks within each stage (20–30 nodes).
- Each stage should be 1–2 weeks; each task 1–4 hours.
- Save checkpoints after each stage for state recovery.
- Assume 20% error rates and plan contingency tasks; don't assume perfect execution.
- Escalate to human if a stage fails after retries.
Frequently Asked Questions
How many stages should a long-horizon plan have?
4–8 stages for a 3–6 month goal. Fewer stages = less granularity (harder to debug); more stages = more overhead. A sweet spot is stages you'd assign to different team members if this were human work.
What if a stage takes longer than estimated?
Log the overrun, but don't stop. Extend the plan's horizon. In some cases, rescope: ask the LLM, "Given this delay, what can we skip while still meeting the goal?"
Can I run stages in parallel?
Only if they have no dependencies. In most projects, stages must run sequentially (Stage 1's outputs are Stage 2's inputs). Parallelization happens within a stage: multiple tasks in the same stage can run in parallel.
How do I handle external service dependencies in a multi-day plan?
Add explicit "verify dependency" tasks at stage boundaries. Before moving to Stage 2, check: "Is the data warehouse still online?" If not, trigger a fallback plan.
Further Reading
- Hierarchical Task Network Planning (Erol et al., 1994) — foundational work on decomposing goals hierarchically.
- OpenAI: Task Decomposition in Multi-Agent Systems — how OpenAI structures multi-day projects.
- Google Cloud: Long-Running Workflow Patterns — production patterns for multi-day tasks.