Skip to main content

Reflection and Self-Correction: Autonomous Agent Improvement

Reflection is the ability of an agent to evaluate its own work, identify mistakes, and correct them without human guidance. An agent with reflection looks at its output, asks "Does this satisfy the criteria?" and if not, revises. This single capability can reduce error rates by 40–60% and eliminate entire classes of hallucinations (false claims, incomplete answers, logical contradictions).

The Reflection Loop

A reflection loop has three steps:

  1. Generate: Agent produces an output (answer, code, plan, etc.).
  2. Evaluate: Agent or an automated checker assesses the output against success criteria.
  3. Revise: If the output fails, agent generates a revised version and loops back to Evaluate.

The loop continues until either the output passes evaluation or a max-revision count is exceeded (then escalate to human).

Here's how this looks in code:

class ReflectiveAgent:
def __init__(self, llm_client, max_revisions=3):
self.llm = llm_client
self.max_revisions = max_revisions

def generate_and_refine(self, task: dict, max_attempts=3) -> dict:
"""Generate output, evaluate, revise until passing or max_attempts."""

success_criterion = task.get("success_criterion", "")
attempt = 0

while attempt < max_attempts:
# STEP 1: GENERATE
if attempt == 0:
# First attempt: solve the task
response = self.llm.completion(task["prompt"])
else:
# Revision attempts: use reflection prompt
response = self.llm.completion(
f"""Your previous output failed this criterion:
{self.last_eval_feedback}

Here was your previous output:
{self.last_output}

Revise your output to address the failure. Focus on the specific issue."""
)

self.last_output = response

# STEP 2: EVALUATE
eval_result = self.evaluate(response, success_criterion)
self.last_eval_feedback = eval_result["feedback"]

if eval_result["passed"]:
return {
"status": "success",
"output": response,
"attempts": attempt + 1,
"criterion": success_criterion
}

# STEP 3: REVISE (loop)
attempt += 1

return {
"status": "max_attempts_exceeded",
"last_output": response,
"last_feedback": self.last_eval_feedback,
"attempts": attempt
}

def evaluate(self, output: str, criterion: str) -> dict:
"""Check if output meets the success criterion."""

if not criterion:
return {"passed": True, "feedback": "No criterion specified."}

# Use LLM as evaluator
eval_prompt = f"""
Evaluate this output against the success criterion.

CRITERION: {criterion}

OUTPUT: {output}

Does the output satisfy the criterion? Reply with:
passed: [yes/no]
reason: [1-2 sentences explaining why/why not]
"""

eval_response = self.llm.completion(eval_prompt)

# Parse response
passed = "yes" in eval_response.lower()
lines = eval_response.split("\n")
reason = "\n".join([l for l in lines if l.strip()])

return {
"passed": passed,
"feedback": reason
}

This pattern is called "self-refinement" and is proven effective: a 2024 study showed that allowing GPT-4 to reflect on its code output reduced bug rates from 18% to 7% in one revision cycle.

Different Reflection Strategies

Not all reflection is the same. Choose a strategy based on your task:

1. Criterion-based reflection: Evaluate against explicit success criteria (covered above). Best for well-defined outputs.

def criterion_reflection(output, criterion):
evaluator_prompt = f"Does this satisfy '{criterion}'? Output: {output}"
return llm.completion(evaluator_prompt)

2. Semantic verification: Check for logical consistency and hallucinations.

def semantic_reflection(output):
verifier_prompt = f"""
Check for issues:
1. Are any factual claims unsupported?
2. Do conclusions follow from the evidence?
3. Are there logical contradictions?

Output: {output}
"""
return llm.completion(verifier_prompt)

3. Comparative reflection: Regenerate multiple versions and pick the best one.

def comparative_reflection(task, num_versions=3):
"""Generate multiple outputs, evaluate each, return the best."""
outputs = [llm.completion(task["prompt"]) for _ in range(num_versions)]

scores = []
for output in outputs:
score = evaluate(output, task["success_criterion"])
scores.append(score)

best_idx = max(range(len(scores)), key=lambda i: scores[i])
return outputs[best_idx]

4. Trace-based reflection: Step through the agent's reasoning and check each step.

def trace_reflection(thought_chain: list, output: str):
"""Verify each reasoning step leads logically to the next."""
for i in range(len(thought_chain) - 1):
current = thought_chain[i]
next_step = thought_chain[i + 1]

coherence_check = llm.completion(
f"Does '{next_step}' logically follow from '{current}'? yes/no"
)

if "no" in coherence_check.lower():
return False # Reasoning chain is broken

return True

Combining Reflection with Plan-and-Execute

In a plan-and-execute agent, you can add reflection after each task execution:

def execute_with_reflection(task: dict, inputs: dict) -> dict:
"""Execute a task and reflect on the output."""

# Execute
output = executor.run_task(task, inputs)

# Evaluate
success_criterion = task.get("success_criterion")
if not success_criterion:
return {"status": "done", "output": output}

eval_result = evaluate(output, success_criterion)

if eval_result["passed"]:
return {"status": "done", "output": output}

# Revise (up to 2 times)
for revision in range(2):
revised_output = llm.completion(
f"""Your output failed: {eval_result['feedback']}

Previous output: {output}

Generate a revised output that addresses the failure."""
)

eval_result = evaluate(revised_output, success_criterion)
if eval_result["passed"]:
return {"status": "revised", "output": revised_output, "attempts": revision + 1}

# If still failing, escalate
return {"status": "escalated", "output": output, "feedback": eval_result["feedback"]}

Cost and Latency Trade-offs

Reflection adds latency and cost. Each reflection loop is an extra LLM call. In production, you need to weigh:

  • High-stakes tasks (legal documents, medical advice): Reflect 2–3 times; cost/latency is acceptable.
  • Real-time tasks (chat, search): Skip reflection; validate once in post-processing.
  • Batch tasks (reports, analysis): Reflect 1–2 times; you have time.

One optimization: parallelize. Generate 3 versions in parallel, evaluate all 3 in parallel, pick the best. This halves latency compared to serial reflection.

Key Takeaways

  • Reflection loops: generate, evaluate, revise until passing or max-attempts reached.
  • LLM-as-evaluator works well for semantic checks; use hard rules for structural checks.
  • Comparative reflection (multiple outputs) is simple and effective.
  • Add reflection to high-stakes tasks; skip for real-time, latency-sensitive work.
  • Reflection reduces errors 40–60% but adds latency; balance is task-specific.

Frequently Asked Questions

How many reflection loops should I allow?

Typically 1–3. A 2024 paper found diminishing returns after 3 revisions: accuracy plateaus or declines. For mission-critical tasks, 3. For standard tasks, 1–2.

Can I reflect on reflection?

Yes, but rarely needed. Ask the LLM to evaluate its own evaluation: "Did your previous evaluation correctly identify issues?" This is meta-reflection and is most useful if you suspect the evaluator itself is unreliable.

What if the LLM keeps producing the same wrong output?

It's stuck. Escalate to human or try a completely different prompt. If output is A and evaluator says it's wrong, regenerating A will happen again. Break the loop by using a different approach.

Should I show reflections to the user?

Optionally. Showing "Attempt 1 failed because..., Attempt 2 succeeds because..." is pedagogically valuable and increases trust. It adds 2–3 sentences to the output.

Further Reading