Test-Driven Agents: Validating Code Changes Autonomously
The most dangerous moment in an agent's life is right after it modifies code. Did it break something? Did it introduce a bug? The best defense is a test-driven loop: after each edit, the agent runs tests, parses results, and if tests fail, it adapts and retries. This feedback loop is what separates agents that produce code humans must review carefully from agents that produce code that is already validated.
This article covers how to build test-driven agent loops that catch errors early, drive correctness, and give humans confidence in agent output.
The Agent Test Loop: 4-Step Cycle
A robust agent operates in this cycle:
1. Plan: understand the task and existing tests
2. Modify: edit code (1–5 changes)
3. Validate: run tests immediately
4. Adapt or Complete: if tests pass, done; if fail, loop back to step 2
The loop terminates when tests pass or after N retries (default 3–5). This approach is inspired by test-driven development (TDD) but run at agent speed.
def test_driven_agent_loop(prompt: str,
max_iterations: int = 5,
required_tests_pass: bool = True) -> dict:
"""
Main agent loop: code → test → adapt → repeat.
"""
context = [{"role": "user", "content": prompt}]
iteration = 0
test_results = None
while iteration < max_iterations:
iteration += 1
# Step 1: Ask agent what to do next
response = llm.chat(
context,
tools=[
{"name": "read_file", "description": "Read a file"},
{"name": "edit_file", "description": "Edit a file"},
{"name": "run_tests", "description": "Run tests and return results"},
{"name": "run_command", "description": "Run any command"}
],
system="""You are a code-editing agent. Your goal is to solve the task.
After each edit, run tests to validate your changes. If tests fail, analyze the failure
and adapt your approach. Do not make random edits; reason about the failure."""
)
# Step 2: Execute tool calls
if response.stop_reason == "end_turn":
# Agent says it's done
return {
"success": True,
"message": response.text,
"iterations": iteration,
"final_tests": test_results
}
for tool_call in response.tool_use:
tool_name, args = tool_call["name"], tool_call["input"]
result = None
if tool_name == "edit_file":
result = edit_file_safe(args["path"], args["old"], args["new"])
elif tool_name == "run_tests":
result = run_tests_with_parsing(args.get("test_path", "tests/"))
test_results = result # Remember for final report
# If tests fail, inform agent clearly
if not result["success"] and "parsed" in result:
result["advisory"] = (
f"Tests failed: {result['parsed']['failed']} failures, "
f"{result['parsed']['passed']} passes. "
"Review the error output above and fix the issue."
)
elif tool_name == "run_command":
result = run_command_safe(args["command"], timeout_seconds=15)
elif tool_name == "read_file":
result = {"content": open(args["path"]).read()}
# Add result to context (agent sees feedback immediately)
context.append({"role": "assistant", "content": response.text})
context.append({
"role": "user",
"content": f"Tool '{tool_name}' result:\n{json.dumps(result, indent=2)[:1000]}"
})
# Max iterations reached
return {
"success": False,
"message": f"Max iterations ({max_iterations}) reached without passing all tests",
"iterations": iteration,
"final_tests": test_results
}
Step 1: Agent Plans Based on Tests
Before modifying code, the agent should understand existing tests. This guides its edits:
def agent_understand_tests(repo_path: str) -> dict:
"""Have agent read and understand the test suite."""
test_files = glob.glob(f"{repo_path}/**/test_*.py", recursive=True)
test_contents = {}
for test_file in test_files[:5]: # Limit to first 5 test files
try:
with open(test_file) as f:
test_contents[test_file] = f.read()
except:
pass
prompt = f"""
You need to complete this task: "{task}"
Here are the existing tests that your code must pass:
{json.dumps(test_contents, indent=2)}
Please:
1. Understand what the tests expect
2. Identify which tests are relevant to your task
3. Plan your edits to make the tests pass
4. Execute the plan
Do not make edits until you've analyzed the tests.
"""
return prompt
Step 2: Run Tests After Each Edit
After modifying code, immediately run tests. Parse the results to give the agent actionable feedback:
def run_tests_with_parsing(test_path: str = "tests/") -> dict:
"""Run tests and return structured pass/fail info."""
result = run_command_safe(
f"pytest {test_path} -v --tb=short --json-report --json-report-file=report.json",
timeout_seconds=20
)
if not result["success"]:
# Tests failed or pytest crashed
return {
"success": False,
"output": result["stderr"],
"parsed": None
}
# Parse pytest JSON report
try:
with open("report.json") as f:
report = json.load(f)
passed = len([t for t in report.get("tests", []) if t["outcome"] == "passed"])
failed = len([t for t in report.get("tests", []) if t["outcome"] == "failed"])
failures = []
for test in report.get("tests", []):
if test["outcome"] == "failed":
failures.append({
"test": test["nodeid"],
"error": test.get("call", {}).get("longrepr", "Unknown error")[:300]
})
return {
"success": failed == 0,
"passed": passed,
"failed": failed,
"parsed": {
"passed": passed,
"failed": failed,
"failures": failures
}
}
except:
return {
"success": result["exit_code"] == 0,
"output": result["stdout"] + result["stderr"],
"parsed": None
}
Step 3: Agent Analyzes Failures and Adapts
When tests fail, the agent must interpret the error and revise:
def agent_failure_analysis_prompt(test_failure: dict, codebase_context: str) -> str:
"""Generate a prompt for the agent to analyze and fix a test failure."""
failure_summary = f"""
Test failed: {test_failure['test']}
Error: {test_failure['error']}
The test is failing because the code is not correct. Your options:
1. The function logic is wrong—fix the algorithm
2. The function signature is wrong—update the signature
3. Edge cases are not handled—add defensive code
4. Dependencies are missing—add imports
Please:
1. Read the failing test code
2. Read the code being tested
3. Identify the root cause
4. Make ONE targeted fix
5. Re-run the test
Focus on fixing ONE issue at a time, not refactoring.
"""
return failure_summary
Here's a real example of an agent fixing a test:
Agent observes:
FAILED tests/test_auth.py::test_password_validation
Error: AssertionError: assert False == True
Assertion line: assert validate_password("short") == True
Agent reads the code:
def validate_password(pwd):
return len(pwd) > 10 # Requires 10+ chars
Agent analysis:
"The test expects 'short' to pass validation, but it's only 5 chars.
The function requires 10+. Either the test or the function is wrong.
Since I was asked to make the function validate passwords, I should
assume the test is correct (5-char passwords should pass). Let me
reduce the minimum length to 5."
Agent edits:
OLD: return len(pwd) > 10
NEW: return len(pwd) >= 5
Agent re-runs: PASSED
Step 4: Stopping Criteria
The agent stops when one of these is true:
- All tests pass — task complete.
- Max iterations reached — escalate to human.
- Same test fails 2+ times — likely requires human insight.
- Agent makes no progress (test failures unchanged) — escalate.
def should_continue_loop(current_result: dict,
previous_results: list,
max_iterations: int = 5) -> tuple[bool, str]:
"""Decide if agent should continue looping."""
# Success: all tests pass
if current_result.get("success"):
return False, "All tests passed"
# Max iterations
if len(previous_results) >= max_iterations:
return False, f"Reached max iterations ({max_iterations})"
# Same failure repeated
if len(previous_results) >= 2:
prev = previous_results[-1]
if (prev.get("parsed", {}).get("failures") ==
current_result.get("parsed", {}).get("failures")):
return False, "Same test failure repeated; needs human review"
# No progress (failures not decreasing)
if len(previous_results) >= 3:
trend = [r.get("parsed", {}).get("failed", 0) for r in previous_results[-3:]]
if trend == sorted(trend, reverse=True): # Increasing or flat
return False, "Test failures not decreasing; escalate"
return True, "Continue loop"
Success Metrics
Track agent success rates to measure effectiveness:
def measure_agent_success(completed_tasks: list) -> dict:
"""Analyze agent success on a batch of tasks."""
total = len(completed_tasks)
succeeded = len([t for t in completed_tasks if t["success"]])
iterations_to_success = [
t["iterations"] for t in completed_tasks if t["success"]
]
return {
"success_rate": succeeded / total,
"avg_iterations": sum(iterations_to_success) / len(iterations_to_success),
"failures_requiring_human": total - succeeded,
"improvement_potential": f"Focus on {list(set([t.get('failure_reason') for t in completed_tasks if not t['success']]))}"
}
Production agents typically achieve:
- 70–80% success rate on first pass (no human intervention).
- Average 2–3 iterations to pass tests.
- 95%+ success rate with one human review/fix.
Key Takeaways
- Test-driven loops drive agent correctness: edit → test → adapt → repeat.
- Agent must understand tests before modifying code (prevents hallucination).
- Parse test output to give the agent actionable feedback, not raw error text.
- Agent should analyze failures and make targeted fixes (not random retries).
- Stop looping when tests pass, max iterations reached, or no progress is made.
Frequently Asked Questions
What if no tests exist?
Have the agent write tests first, then implement code to pass them. Or provide the agent with example test cases as part of the prompt. Tests are essential for agent validation.
Can agents get stuck in infinite loops (constantly failing tests)?
Yes. Use stopping criteria (max iterations, same failure repeating, no progress trend). If the agent gets stuck, escalate to human with context (what it tried, why it failed).
How do I debug an agent that makes bad edits?
Log every iteration: what the agent changed, what the test output was, what it concluded. Review these logs to understand the reasoning. Often agents misinterpret error messages—improve the error message format.
What if a test takes 60 seconds to run?
Set a longer timeout (120s), but this becomes inefficient. Refactor: extract fast unit tests that agent runs frequently, reserve slow integration tests for human CI. Agents need fast feedback loops.