Skip to main content

Long-Horizon Task Planning: Multi-Step Objectives Over Time

Long-horizon planning is the challenge of orchestrating 20, 50, or 100+ steps over hours or days toward a distant goal. Naive approaches fail: a single task graph with 100 nodes exceeds LLM context; a single ReAct loop with 100 steps accumulates errors at each step. Production systems use hierarchical planning: break the long journey into distinct stages, plan each stage locally, and use checkpoints to prevent error compounding.

The Hierarchy: Goals → Stages → Tasks

Instead of one flat plan, use three levels:

  1. Goal (1–2 months): "Migrate our entire data pipeline to a modern warehouse."
  2. Stages (1–2 weeks each): Audit, design, set up, migrate, validate, train.
  3. Tasks (1–4 hours each): Specific subtasks within a stage.

This structure allows you to reason about stages in a high-level plan (6–10 nodes, fitting easily in context), then plan each stage independently (20–30 tasks per stage, manageable).

@dataclass
class LongHorizonPlan:
"""A multi-stage plan for a long-horizon goal."""
goal: str
stages: List[Stage]
checkpoints: List[Checkpoint]
total_estimated_duration: float # hours

@dataclass
class Stage:
"""A phase of the long-horizon goal."""
id: str
name: str
description: str
goal: str
prerequisite_stages: List[str] # IDs of stages that must complete first
estimated_duration_hours: float
success_criteria: List[str]
tasks: List[Task] # Plan for this stage; generated at execution time

@dataclass
class Checkpoint:
"""Validation point between stages."""
id: str
stage_id: str
check: str # What to verify
success_criteria: str
remediation: str # What to do if check fails

Here's how the planning and execution flow:

class LongHorizonAgent:
def __init__(self, llm_client):
self.llm = llm_client
self.completed_stages = set()
self.stage_results = {}
self.max_retries_per_stage = 2

def plan_long_horizon(self, goal: str, horizon_days: int) -> LongHorizonPlan:
"""Generate a multi-stage plan for a long-horizon goal."""

plan_prompt = f"""
You are a planning expert. Your goal is to break a long-horizon objective into stages.

GOAL: {goal}
HORIZON: {horizon_days} days

Break this into 4-6 major stages. Each stage should:
1. Represent 1-2 weeks of work
2. Have clear success criteria
3. Depend on previous stages' outputs
4. Be independently executable

FORMAT AS JSON:
{{
"stages": [
{{
"id": "stage_1",
"name": "Stage Name",
"description": "What this stage accomplishes",
"goal": "Specific goal for this stage",
"prerequisite_stages": [],
"estimated_duration_hours": 40,
"success_criteria": ["Criterion 1", "Criterion 2"]
}}
],
"rationale": "Why this stage breakdown makes sense"
}}"""

plan = self.llm.completion(plan_prompt, response_format="json")
return LongHorizonPlan(
goal=goal,
stages=[Stage(**s) for s in plan["stages"]],
checkpoints=self.generate_checkpoints(plan["stages"])
)

def execute_stage(self, stage: Stage) -> dict:
"""Execute a single stage with detailed task planning."""

# STEP 1: Detailed task planning for this stage
task_prompt = f"""
You are executing this stage of a larger project:

STAGE GOAL: {stage.goal}
STAGE DESCRIPTION: {stage.description}
SUCCESS CRITERIA: {stage.success_criteria}

Break this stage into 15-30 concrete tasks. Each task should:
1. Be executable in 1-4 hours
2. Have clear inputs and outputs
3. Respect dependencies

FORMAT AS JSON with field "tasks": [...]"""

stage_plan = self.llm.completion(task_prompt, response_format="json")
stage.tasks = stage_plan["tasks"]

# STEP 2: Execute tasks
results = {}
for task in stage_plan["tasks"]:
try:
result = self.execute_task_with_retry(task, results)
results[task["id"]] = result
except TaskFailure as e:
# Log failure; let stage checkpoint decide if stage fails
results[task["id"]] = {"status": "failed", "error": str(e)}

# STEP 3: Validate stage outputs
validation_result = self.validate_stage(stage, results)

if not validation_result["passed"]:
if self.stage_retry_count < self.max_retries_per_stage:
self.stage_retry_count += 1
return self.execute_stage(stage) # Retry entire stage
else:
raise StageFailure(validation_result["reason"])

self.completed_stages.add(stage.id)
self.stage_results[stage.id] = results
return {"status": "success", "results": results}

def execute_task_with_retry(self, task: dict, context: dict) -> str:
"""Execute a single task with 2 retries."""

for attempt in range(3):
try:
# Fetch inputs from prior tasks
inputs = {dep: context[dep] for dep in task.get("depends_on", [])}

result = executor.run_task(task, inputs)

# Validate result
if not self.validate_task_output(task, result):
if attempt < 2:
continue # Retry
else:
raise TaskFailure("Output validation failed after 3 attempts")

return result

except Exception as e:
if attempt == 2:
raise TaskFailure(f"Task {task['id']} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff

def validate_stage(self, stage: Stage, results: dict) -> dict:
"""Check if stage outputs meet success criteria."""

for criterion in stage.success_criteria:
validation_prompt = f"""
Does this stage output satisfy the criterion?

CRITERION: {criterion}

STAGE OUTPUTS: {results}

Reply: passed/failed with brief reason."""

response = self.llm.completion(validation_prompt)
if "failed" in response.lower():
return {"passed": False, "reason": response}

return {"passed": True}

def run_long_horizon(self, goal: str, horizon_days: int) -> dict:
"""Execute a complete long-horizon plan."""

# Plan the entire arc
plan = self.plan_long_horizon(goal, horizon_days)

# Execute stages in dependency order
all_results = {}
for stage in topological_sort(plan.stages):
try:
stage_results = self.execute_stage(stage)
all_results[stage.id] = stage_results
except StageFailure as e:
# Stage failed; escalate to human
return {
"status": "human_escalation_needed",
"failed_stage": stage.id,
"reason": str(e),
"results_so_far": all_results
}

return {
"status": "success",
"goal": goal,
"stages_completed": len(plan.stages),
"all_results": all_results
}

Preventing State Drift

The biggest risk in long-horizon planning: state drifts. You compute a plan assuming Service X is running, but 3 days later Service X is down, invalidating all downstream assumptions.

Mitigation:

def save_stage_checkpoint(stage_id: str, results: dict):
"""Save stage results for recovery if later stages fail."""
checkpoint = {
"timestamp": time.time(),
"stage_id": stage_id,
"results": results,
"world_state": {
"running_services": get_active_services(),
"resource_limits": get_resource_status(),
"data_snapshots": get_data_snapshots()
}
}
save_to_disk(f"checkpoint_{stage_id}.json", checkpoint)

If a later stage fails and needs to replan, you can load this checkpoint and tell the LLM: "Here's what the world looked like when Stage 1 completed; what's changed?"

Handling Multi-Day Execution

For goals spanning multiple days, plan conservatively:

def plan_with_uncertainty(goal: str, horizon_days: int):
"""Account for unknown unknowns in long-horizon planning."""

prompt = f"""
GOAL: {goal}
HORIZON: {horizon_days} days

Break into stages, but assume:
- 20% of tasks will fail and need rework.
- External systems may go offline, requiring workarounds.
- New information will arrive mid-project, requiring pivots.

Build in slack: for each stage, add a "buffer tasks" subtask
that catches unforeseen issues.

FORMAT: JSON with "stages": [...] and "contingency_planning": {...}"""

plan = llm.completion(prompt, response_format="json")
return plan

This is honest: long-term plans are inherently uncertain. Building that into the plan—explicit "handle unexpected" tasks—is more robust than pretending perfect knowledge.

Key Takeaways

  • Long-horizon planning requires hierarchy: Goals → Stages → Tasks.
  • Plan stages at high level (6–10 nodes); plan tasks within each stage (20–30 nodes).
  • Each stage should be 1–2 weeks; each task 1–4 hours.
  • Save checkpoints after each stage for state recovery.
  • Assume 20% error rates and plan contingency tasks; don't assume perfect execution.
  • Escalate to human if a stage fails after retries.

Frequently Asked Questions

How many stages should a long-horizon plan have?

4–8 stages for a 3–6 month goal. Fewer stages = less granularity (harder to debug); more stages = more overhead. A sweet spot is stages you'd assign to different team members if this were human work.

What if a stage takes longer than estimated?

Log the overrun, but don't stop. Extend the plan's horizon. In some cases, rescope: ask the LLM, "Given this delay, what can we skip while still meeting the goal?"

Can I run stages in parallel?

Only if they have no dependencies. In most projects, stages must run sequentially (Stage 1's outputs are Stage 2's inputs). Parallelization happens within a stage: multiple tasks in the same stage can run in parallel.

How do I handle external service dependencies in a multi-day plan?

Add explicit "verify dependency" tasks at stage boundaries. Before moving to Stage 2, check: "Is the data warehouse still online?" If not, trigger a fallback plan.

Further Reading