Skip to main content

Evaluating Agent Plan Quality: Metrics and Frameworks

How do you know if an AI agent's plan is good? "It worked" isn't rigorous. Production systems need metrics to evaluate plan quality, compare planning approaches, and diagnose failures. This article covers metrics for plan completeness, efficiency, robustness, and correctness.

Plan Quality Dimensions

A good plan excels across four dimensions:

DimensionMetricIdeal TargetHow to Measure
CompletenessCoverage of goal100%Does execution of the plan achieve the stated goal?
EfficiencyTask count, duration, costMinimalHow many steps? How long? How much does it cost?
RobustnessFailure handlingGracefulDoes the plan recover from 1–2 task failures?
ClarityTask understandability5/5 clarityCan a human read and critique the plan?

Let's define metrics for each:

Completeness Metrics

Goal Coverage: Does the plan, if executed perfectly, achieve the goal?

def measure_goal_coverage(plan: dict, goal: str) -> float:
"""Return 0.0–1.0: what fraction of the goal does the plan address?"""

coverage_prompt = f"""
GOAL: {goal}

PLAN SUMMARY: {format_plan(plan)}

Does this plan, if executed successfully, achieve the goal entirely?
Rate 0.0 (misses key aspects) to 1.0 (fully covers goal).

Rate as a decimal: [number]
Reason: [brief explanation]"""

response = llm.completion(coverage_prompt)
# Parse response for the number
try:
coverage = float(response.split("[number]")[1].split("]")[0])
except:
coverage = 0.5 # default if parsing fails

return coverage

def measure_task_completeness(plan: dict) -> dict:
"""Return metrics on whether every task has required fields."""

required_fields = ["id", "name", "description", "success_criterion", "depends_on"]

completeness_count = 0
total_checks = 0

for task in plan["tasks"]:
for field in required_fields:
total_checks += 1
if field in task and task[field]:
completeness_count += 1

return {
"completeness_ratio": completeness_count / total_checks if total_checks > 0 else 0,
"tasks_with_missing_fields": [
t["id"] for t in plan["tasks"]
if not all(f in t and t[f] for f in required_fields)
]
}

Efficiency Metrics

Task Count: Fewer is better, but not at the expense of clarity.

def measure_plan_efficiency(plan: dict) -> dict:
"""Quantify plan efficiency: steps, duration, cost."""

metrics = {
"task_count": len(plan.get("tasks", [])),
"estimated_duration_hours": sum(
t.get("estimated_time_secs", 3600) / 3600
for t in plan.get("tasks", [])
),
"estimated_cost_usd": sum(
(t.get("estimated_tokens", 1000) / 1000) * 0.01 # $0.01 per 1k tokens
for t in plan.get("tasks", [])
),
"critical_path_length": critical_path_length(plan),
}

return metrics

def critical_path_length(plan: dict) -> int:
"""Return the longest chain of sequential dependencies."""

graph = build_task_graph(plan)
longest_path = 0

for task_id in graph.tasks:
path_length = dfs_longest_path(graph, task_id)
longest_path = max(longest_path, path_length)

return longest_path

def plan_parallelizability(plan: dict) -> float:
"""Return 0.0–1.0: what fraction of tasks can run in parallel?"""

graph = build_task_graph(plan)

# Compute the parallelism ratio: critical path vs. total tasks
# If critical_path = 5 and total_tasks = 20, parallelism = 5/20 = 0.25
# Meaning 75% of tasks can be parallelized

critical = critical_path_length(plan)
total = len(plan.get("tasks", []))

return critical / total if total > 0 else 1.0

Robustness Metrics

Resilience to Failures: Can the plan handle 1–2 task failures?

def measure_plan_robustness(plan: dict) -> dict:
"""Evaluate how the plan handles failures."""

graph = build_task_graph(plan)

metrics = {}

# Metric 1: Single task failure tolerance
# What fraction of tasks, if they fail, only block themselves (not downstream)?
self_contained_tasks = 0
for task_id in graph.tasks:
dependents = graph.get_dependents(task_id)
if len(dependents) == 0:
self_contained_tasks += 1

metrics["self_contained_task_ratio"] = (
self_contained_tasks / len(graph.tasks)
if len(graph.tasks) > 0 else 0
)

# Metric 2: Alternative paths
# Does the plan have fallback paths (e.g., retry logic, alternative tools)?
fallback_mentions = sum(
1 for t in plan.get("tasks", [])
if "fallback" in t.get("description", "").lower()
or "alternative" in t.get("description", "").lower()
)

metrics["tasks_with_fallbacks"] = fallback_mentions

# Metric 3: Critical task density
# How many tasks are on the critical path (any failure delays completion)?
critical_tasks = plan_critical_tasks(plan)
metrics["critical_task_count"] = len(critical_tasks)
metrics["critical_task_ratio"] = (
len(critical_tasks) / len(plan.get("tasks", []))
if plan.get("tasks", []) else 0
)

return metrics

def plan_critical_tasks(plan: dict) -> List[str]:
"""Return task IDs on the critical path."""

graph = build_task_graph(plan)
critical_path = graph.critical_path()

return [t["id"] for t in critical_path]

Clarity and Auditability Metrics

Understandability: Can a human read and understand the plan?

def measure_plan_clarity(plan: dict) -> dict:
"""Evaluate plan readability and structure."""

metrics = {}

# Metric 1: Average task description length
# Shorter = less detail; longer = harder to skim
desc_lengths = [
len(t.get("description", "").split())
for t in plan.get("tasks", [])
]

metrics["avg_description_words"] = (
sum(desc_lengths) / len(desc_lengths)
if desc_lengths else 0
)
metrics["description_length_ok"] = (
all(5 <= l <= 50 for l in desc_lengths)
if desc_lengths else True
)

# Metric 2: Success criteria specificity
# Are success criteria vague ("looks good") or specific ("JSON with fields X, Y")?

criteria = [t.get("success_criterion", "") for t in plan.get("tasks", [])]
vague_criteria = sum(
1 for c in criteria
if any(w in c.lower() for w in ["looks good", "seems right", "appropriate", "correct"])
)

metrics["specific_criteria_ratio"] = (
(len(criteria) - vague_criteria) / len(criteria)
if criteria else 0
)

# Metric 3: Dependency graph density
# Highly interconnected = harder to understand

total_edges = sum(
len(t.get("depends_on", []))
for t in plan.get("tasks", [])
)
max_edges = len(plan.get("tasks", [])) ** 2

metrics["dependency_density"] = (
total_edges / max_edges if max_edges > 0 else 0
)

return metrics

Benchmarking and Comparative Metrics

Compare multiple plans:

def compare_plans(plans: List[dict], goal: str) -> dict:
"""Compare multiple plans across all dimensions."""

results = {}

for i, plan in enumerate(plans):
plan_id = plan.get("id", f"plan_{i}")

results[plan_id] = {
"completeness": measure_goal_coverage(plan, goal),
"efficiency": measure_plan_efficiency(plan),
"robustness": measure_plan_robustness(plan),
"clarity": measure_plan_clarity(plan),
"score": compute_composite_score(plan, goal)
}

# Recommend best plan
best_plan = max(results.keys(), key=lambda p: results[p]["score"])
results["recommendation"] = best_plan

return results

def compute_composite_score(plan: dict, goal: str, weights: dict = None) -> float:
"""Combine all metrics into a single score (0.0–1.0)."""

if weights is None:
weights = {
"completeness": 0.4,
"efficiency": 0.2,
"robustness": 0.3,
"clarity": 0.1
}

completeness = measure_goal_coverage(plan, goal)

efficiency_metrics = measure_plan_efficiency(plan)
# Normalize to 0.0–1.0: fewer tasks = higher score
efficiency = 1.0 / (1.0 + efficiency_metrics["task_count"] / 10)

robustness_metrics = measure_plan_robustness(plan)
robustness = robustness_metrics["self_contained_task_ratio"]

clarity_metrics = measure_plan_clarity(plan)
clarity = clarity_metrics["specific_criteria_ratio"]

return (
weights["completeness"] * completeness +
weights["efficiency"] * efficiency +
weights["robustness"] * robustness +
weights["clarity"] * clarity
)

Key Takeaways

  • Evaluate plans across four dimensions: completeness, efficiency, robustness, clarity.
  • Completeness: does execution of the plan achieve the goal? (0.0–1.0 scale)
  • Efficiency: fewer tasks, shorter duration, lower cost; measure critical path length.
  • Robustness: how many tasks have fallbacks? How many are on the critical path?
  • Clarity: are descriptions 5–50 words? Are success criteria specific, not vague?
  • Composite scoring lets you compare multiple plans and recommend the best.

Frequently Asked Questions

How do I weight different metrics when comparing plans?

Depends on context. For time-sensitive tasks, weight efficiency (critical path) heavily. For mission-critical tasks, weight robustness. For user-facing tasks, weight clarity. Set weights based on constraints, not gut feeling.

Should I measure plan quality before or after execution?

Both. Before execution: predict quality from the plan structure. After execution: measure actual quality (time, cost, failures). If prediction and reality diverge, refine your prediction model.

How do I handle plans I can't numerically compare?

Use qualitative evaluation: have a human expert review the plan and rate it 1–5 for completeness, efficiency, robustness, clarity. Average the ratings. For research, compare with a baseline plan.

Can I optimize a plan after generating it?

Yes. Generate 3–5 candidate plans, evaluate all, then refine the best one. Or iteratively improve: run the plan, collect metrics, then regenerate with feedback: "Your plan took 8 hours when estimated 4. Next time, add more parallelism."

Further Reading