Evaluating Agent Plan Quality: Metrics and Frameworks
How do you know if an AI agent's plan is good? "It worked" isn't rigorous. Production systems need metrics to evaluate plan quality, compare planning approaches, and diagnose failures. This article covers metrics for plan completeness, efficiency, robustness, and correctness.
Plan Quality Dimensions
A good plan excels across four dimensions:
| Dimension | Metric | Ideal Target | How to Measure |
|---|---|---|---|
| Completeness | Coverage of goal | 100% | Does execution of the plan achieve the stated goal? |
| Efficiency | Task count, duration, cost | Minimal | How many steps? How long? How much does it cost? |
| Robustness | Failure handling | Graceful | Does the plan recover from 1–2 task failures? |
| Clarity | Task understandability | 5/5 clarity | Can a human read and critique the plan? |
Let's define metrics for each:
Completeness Metrics
Goal Coverage: Does the plan, if executed perfectly, achieve the goal?
def measure_goal_coverage(plan: dict, goal: str) -> float:
"""Return 0.0–1.0: what fraction of the goal does the plan address?"""
coverage_prompt = f"""
GOAL: {goal}
PLAN SUMMARY: {format_plan(plan)}
Does this plan, if executed successfully, achieve the goal entirely?
Rate 0.0 (misses key aspects) to 1.0 (fully covers goal).
Rate as a decimal: [number]
Reason: [brief explanation]"""
response = llm.completion(coverage_prompt)
# Parse response for the number
try:
coverage = float(response.split("[number]")[1].split("]")[0])
except:
coverage = 0.5 # default if parsing fails
return coverage
def measure_task_completeness(plan: dict) -> dict:
"""Return metrics on whether every task has required fields."""
required_fields = ["id", "name", "description", "success_criterion", "depends_on"]
completeness_count = 0
total_checks = 0
for task in plan["tasks"]:
for field in required_fields:
total_checks += 1
if field in task and task[field]:
completeness_count += 1
return {
"completeness_ratio": completeness_count / total_checks if total_checks > 0 else 0,
"tasks_with_missing_fields": [
t["id"] for t in plan["tasks"]
if not all(f in t and t[f] for f in required_fields)
]
}
Efficiency Metrics
Task Count: Fewer is better, but not at the expense of clarity.
def measure_plan_efficiency(plan: dict) -> dict:
"""Quantify plan efficiency: steps, duration, cost."""
metrics = {
"task_count": len(plan.get("tasks", [])),
"estimated_duration_hours": sum(
t.get("estimated_time_secs", 3600) / 3600
for t in plan.get("tasks", [])
),
"estimated_cost_usd": sum(
(t.get("estimated_tokens", 1000) / 1000) * 0.01 # $0.01 per 1k tokens
for t in plan.get("tasks", [])
),
"critical_path_length": critical_path_length(plan),
}
return metrics
def critical_path_length(plan: dict) -> int:
"""Return the longest chain of sequential dependencies."""
graph = build_task_graph(plan)
longest_path = 0
for task_id in graph.tasks:
path_length = dfs_longest_path(graph, task_id)
longest_path = max(longest_path, path_length)
return longest_path
def plan_parallelizability(plan: dict) -> float:
"""Return 0.0–1.0: what fraction of tasks can run in parallel?"""
graph = build_task_graph(plan)
# Compute the parallelism ratio: critical path vs. total tasks
# If critical_path = 5 and total_tasks = 20, parallelism = 5/20 = 0.25
# Meaning 75% of tasks can be parallelized
critical = critical_path_length(plan)
total = len(plan.get("tasks", []))
return critical / total if total > 0 else 1.0
Robustness Metrics
Resilience to Failures: Can the plan handle 1–2 task failures?
def measure_plan_robustness(plan: dict) -> dict:
"""Evaluate how the plan handles failures."""
graph = build_task_graph(plan)
metrics = {}
# Metric 1: Single task failure tolerance
# What fraction of tasks, if they fail, only block themselves (not downstream)?
self_contained_tasks = 0
for task_id in graph.tasks:
dependents = graph.get_dependents(task_id)
if len(dependents) == 0:
self_contained_tasks += 1
metrics["self_contained_task_ratio"] = (
self_contained_tasks / len(graph.tasks)
if len(graph.tasks) > 0 else 0
)
# Metric 2: Alternative paths
# Does the plan have fallback paths (e.g., retry logic, alternative tools)?
fallback_mentions = sum(
1 for t in plan.get("tasks", [])
if "fallback" in t.get("description", "").lower()
or "alternative" in t.get("description", "").lower()
)
metrics["tasks_with_fallbacks"] = fallback_mentions
# Metric 3: Critical task density
# How many tasks are on the critical path (any failure delays completion)?
critical_tasks = plan_critical_tasks(plan)
metrics["critical_task_count"] = len(critical_tasks)
metrics["critical_task_ratio"] = (
len(critical_tasks) / len(plan.get("tasks", []))
if plan.get("tasks", []) else 0
)
return metrics
def plan_critical_tasks(plan: dict) -> List[str]:
"""Return task IDs on the critical path."""
graph = build_task_graph(plan)
critical_path = graph.critical_path()
return [t["id"] for t in critical_path]
Clarity and Auditability Metrics
Understandability: Can a human read and understand the plan?
def measure_plan_clarity(plan: dict) -> dict:
"""Evaluate plan readability and structure."""
metrics = {}
# Metric 1: Average task description length
# Shorter = less detail; longer = harder to skim
desc_lengths = [
len(t.get("description", "").split())
for t in plan.get("tasks", [])
]
metrics["avg_description_words"] = (
sum(desc_lengths) / len(desc_lengths)
if desc_lengths else 0
)
metrics["description_length_ok"] = (
all(5 <= l <= 50 for l in desc_lengths)
if desc_lengths else True
)
# Metric 2: Success criteria specificity
# Are success criteria vague ("looks good") or specific ("JSON with fields X, Y")?
criteria = [t.get("success_criterion", "") for t in plan.get("tasks", [])]
vague_criteria = sum(
1 for c in criteria
if any(w in c.lower() for w in ["looks good", "seems right", "appropriate", "correct"])
)
metrics["specific_criteria_ratio"] = (
(len(criteria) - vague_criteria) / len(criteria)
if criteria else 0
)
# Metric 3: Dependency graph density
# Highly interconnected = harder to understand
total_edges = sum(
len(t.get("depends_on", []))
for t in plan.get("tasks", [])
)
max_edges = len(plan.get("tasks", [])) ** 2
metrics["dependency_density"] = (
total_edges / max_edges if max_edges > 0 else 0
)
return metrics
Benchmarking and Comparative Metrics
Compare multiple plans:
def compare_plans(plans: List[dict], goal: str) -> dict:
"""Compare multiple plans across all dimensions."""
results = {}
for i, plan in enumerate(plans):
plan_id = plan.get("id", f"plan_{i}")
results[plan_id] = {
"completeness": measure_goal_coverage(plan, goal),
"efficiency": measure_plan_efficiency(plan),
"robustness": measure_plan_robustness(plan),
"clarity": measure_plan_clarity(plan),
"score": compute_composite_score(plan, goal)
}
# Recommend best plan
best_plan = max(results.keys(), key=lambda p: results[p]["score"])
results["recommendation"] = best_plan
return results
def compute_composite_score(plan: dict, goal: str, weights: dict = None) -> float:
"""Combine all metrics into a single score (0.0–1.0)."""
if weights is None:
weights = {
"completeness": 0.4,
"efficiency": 0.2,
"robustness": 0.3,
"clarity": 0.1
}
completeness = measure_goal_coverage(plan, goal)
efficiency_metrics = measure_plan_efficiency(plan)
# Normalize to 0.0–1.0: fewer tasks = higher score
efficiency = 1.0 / (1.0 + efficiency_metrics["task_count"] / 10)
robustness_metrics = measure_plan_robustness(plan)
robustness = robustness_metrics["self_contained_task_ratio"]
clarity_metrics = measure_plan_clarity(plan)
clarity = clarity_metrics["specific_criteria_ratio"]
return (
weights["completeness"] * completeness +
weights["efficiency"] * efficiency +
weights["robustness"] * robustness +
weights["clarity"] * clarity
)
Key Takeaways
- Evaluate plans across four dimensions: completeness, efficiency, robustness, clarity.
- Completeness: does execution of the plan achieve the goal? (0.0–1.0 scale)
- Efficiency: fewer tasks, shorter duration, lower cost; measure critical path length.
- Robustness: how many tasks have fallbacks? How many are on the critical path?
- Clarity: are descriptions 5–50 words? Are success criteria specific, not vague?
- Composite scoring lets you compare multiple plans and recommend the best.
Frequently Asked Questions
How do I weight different metrics when comparing plans?
Depends on context. For time-sensitive tasks, weight efficiency (critical path) heavily. For mission-critical tasks, weight robustness. For user-facing tasks, weight clarity. Set weights based on constraints, not gut feeling.
Should I measure plan quality before or after execution?
Both. Before execution: predict quality from the plan structure. After execution: measure actual quality (time, cost, failures). If prediction and reality diverge, refine your prediction model.
How do I handle plans I can't numerically compare?
Use qualitative evaluation: have a human expert review the plan and rate it 1–5 for completeness, efficiency, robustness, clarity. Average the ratings. For research, compare with a baseline plan.
Can I optimize a plan after generating it?
Yes. Generate 3–5 candidate plans, evaluate all, then refine the best one. Or iteratively improve: run the plan, collect metrics, then regenerate with feedback: "Your plan took 8 hours when estimated 4. Next time, add more parallelism."
Further Reading
- Planning as Heuristic Search (Bonet & Geffner, 2001) — foundational work on plan quality metrics.
- PDDL: Planning Domain Definition Language — standard format for planning problems with built-in quality metrics.
- Anthropic: Evaluating Agent Plans (Internal Paper, 2024) — modern perspective on assessing LLM-generated plans.