Skip to main content

Detecting and Fixing Flaky Tests: AI-Driven Stability

A flaky test is one that passes and fails intermittently without code changes, undermining test reliability and developer confidence. Root causes include race conditions (timing), shared state (test pollution), external dependencies (network), and non-deterministic behavior (randomness). Flaky tests waste 25–40 hours per developer per year in false-positive debugging. AI accelerates detection by analyzing test logs, identifying patterns, and recommending fixes.

I spent six weeks hunting flaky tests in a microservices test suite last year, manually reviewing logs and rerunning tests. With AI-driven analysis, I identified root causes in two days. This guide teaches systematic detection, diagnosis, and remediation of flaky tests.

Why Tests Become Flaky and Why AI Helps

Tests fail for deterministic reasons (bugs) or flaky ones (timing, state, randomness). Flaky failures are hard to reproduce and debug because they're triggered by conditions that are hard to control. AI helps because language models can spot patterns in logs and suggest architectural fixes even without executing the tests.

Common causes:

  • Timing/race conditions: Test depends on async operation completing within a fixed time window.
  • Shared state: Tests modify global state (database, cache, config) without cleanup.
  • Non-determinism: Tests depend on UUID, randomness, system clock without seeding.
  • External dependencies: Tests call real APIs, databases, or services that are unreliable.
  • Assertion races: Tests assert state that's still changing (e.g., eventual consistency).

Technique 1: AI Detection of Flakiness from Test Logs

Rather than running tests 100 times manually to identify flakiness, feed test logs to AI for pattern detection.

Example test log with failures (from 10 runs):

Run 1: PASS
Run 2: PASS
Run 3: FAIL – timeout waiting for element #notification
Run 4: PASS
Run 5: PASS
Run 6: FAIL – timeout waiting for element #notification
Run 7: PASS
Run 8: FAIL – assertion error: expected 'active' but got 'pending'
Run 9: PASS
Run 10: FAIL – timeout waiting for element #notification

Prompt the AI:

Analyze this test run log. Identify: (1) is the test flaky? (2) what's the pattern? (3) most likely root cause?

AI response:

STATUS: Flaky (40% failure rate)

FAILURE PATTERN:
- Runs 3, 6, 10 timeout on #notification element (70% of failures)
- Run 8 fails with state assertion (30% of failures)

ROOT CAUSE ANALYSIS:
1. Primary (70%): The test waits for an element with a hardcoded 5-second timeout.
When the notification is slow to render (likely due to test environment load),
the timeout expires. Root: Insufficient wait time or missing explicit waits.

2. Secondary (30%): Test assumes notification status is 'active', but sometimes
reads 'pending'. Root: Race condition on async state update; test doesn't wait
for state transition.

RECOMMENDATIONS:
- Replace fixed sleep with explicit wait:
OLD: time.sleep(5); element = driver.find_element(...)
NEW: WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "notification")))
- Add wait before assertion:
OLD: assert element.text == "active"
NEW: WebDriverWait(driver, 10).until(EC.text_to_be_present_in_element((By.ID, "status"), "active"))

Technique 2: AI-Assisted Root Cause Diagnosis from Code and Logs

Combine test code and failure logs for diagnosis:

def test_send_notification_shows_success_message():
"""Test that notification is sent and success msg shown."""
driver.get("https://app.example.com/notify")

# Fill form and submit
driver.find_element(By.ID, "email").send_keys("[email protected]")
driver.find_element(By.ID, "submit").click()

# Wait for notification (PROBLEM: hardcoded sleep)
time.sleep(5)

# Check for success message
element = driver.find_element(By.ID, "notification")
assert "success" in element.text

Prompt the AI:

Analyze this test code and its failure log (attached). Identify: (1) why does it timeout intermittently? (2) is the logic correct? (3) what's the minimal fix?

AI response:

ISSUE 1: Hardcoded time.sleep(5)
- Problem: Assumes notification renders in <=5 seconds. On slow CI or under load,
notification takes 6+ seconds. Test flakes.
- Fix: Use WebDriverWait with explicit condition instead of sleep.

ISSUE 2: Assertion race condition
- The test asserts on element.text immediately after finding it. But if the
element is still updating (e.g., AJAX), text might be empty or "pending".
- Fix: Wait for the specific text content before asserting.

ISSUE 3: No cleanup
- Test doesn't log out or clear state after running. If another test uses same
user/session, state pollution can cause failures.
- Fix: Add teardown that clears local storage / logs out.

FIXED CODE:
def test_send_notification_shows_success_message():
try:
driver.get("https://app.example.com/notify")
driver.find_element(By.ID, "email").send_keys("[email protected]")
driver.find_element(By.ID, "submit").click()

# Wait for element with explicit condition (max 10 seconds)
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "notification"))
)

# Wait for specific text content
WebDriverWait(driver, 10).until(
EC.text_to_be_present_in_element((By.ID, "notification"), "success")
)

assert "success" in element.text
finally:
# Cleanup
driver.execute_script("window.localStorage.clear();")
driver.find_element(By.ID, "logout").click() if driver.find_elements(By.ID, "logout") else None

Technique 3: Detecting and Seeding Non-Determinism

Many tests fail intermittently due to randomness (UUIDs, timestamps, random sampling). Prompt AI to identify and seed non-determinism:

import uuid
import random
from datetime import datetime

def test_generate_unique_invoice_ids():
"""Test invoice generation."""
invoices = [Invoice(id=uuid.uuid4(), amount=random.randint(10, 1000)) for _ in range(5)]

# This test can fail if UUIDs happen to collide (extremely rare but possible)
# or if random amounts don't pass some arbitrary check
assert len(set(inv.id for inv in invoices)) == 5
assert all(inv.amount > 0 for inv in invoices)

Prompt the AI:

This test is non-deterministic and flaky. Identify randomness sources and suggest fixes to make it deterministic.

AI response:

RANDOMNESS SOURCES:
1. uuid.uuid4(): Generates random UUIDs. Collision is virtually impossible, but
test's determinism depends on external randomness.
2. random.randint(): Amount is random. Test doesn't control what amounts are tested.

FIXES:
Option A: Seed randomness
- Add random.seed(42) at test start to make results reproducible.
- Pro: Preserves randomness exploration. Con: Hidden dependency on seed value.

Option B: Inject test doubles
- Use mock UUID factory: uuid.uuid4 = Mock(side_effect=['uuid1', 'uuid2', ...])
- Use fixed amounts: amounts = [100, 200, 300, 400, 500]
- Pro: Explicit, no hidden seeding. Con: Loses randomness testing.

Option C: Parameterized deterministic test + separate property test
- Unit test with fixed inputs (deterministic).
- Property test with Hypothesis for randomness exploration.

RECOMMENDED APPROACH:
from unittest.mock import patch, Mock

def test_generate_unique_invoice_ids():
def uuid_factory():
for i in range(100):
yield f"invoice_{i:05d}"

with patch('uuid.uuid4', side_effect=uuid_factory()):
invoices = [
Invoice(id=uuid.uuid4(), amount=amount)
for amount in [100, 200, 300, 400, 500]
]

assert len(invoices) == 5
assert all(inv.amount > 0 for inv in invoices)

Technique 4: Detecting Test Order Dependencies

Tests can fail when run in different orders if they share state. AI can analyze test names and shared fixtures to spot order dependencies:

def test_users_list_shows_alice():
users_db.insert(User("alice", "[email protected]"))
result = api.get("/users")
assert "alice" in [u.name for u in result]

def test_users_list_empty_when_no_data():
# This test assumes users_db is empty, but if test_users_list_shows_alice
# runs first, alice is still in the database!
result = api.get("/users")
assert len(result) == 0

Prompt the AI:

These two tests will fail if run in this order: [test_users_list_shows_alice, test_users_list_empty_when_no_data]. Identify the problem and suggest fix.

AI response:

PROBLEM: Test order dependency
- Test 1 inserts alice into users_db.
- Test 2 expects an empty result, but test 1's insertion persists.
- If test 2 runs first, it passes. If test 1 runs first, test 2 fails.
- Root: Shared database state not cleaned up after test 1.

SOLUTION 1: Add cleanup fixtures
@pytest.fixture(autouse=True)
def clear_database():
yield
users_db.clear() # Cleanup after each test

SOLUTION 2: Use transaction rollback
@pytest.fixture(autouse=True)
def transaction():
with users_db.transaction():
yield
# Implicit rollback when exiting transaction

SOLUTION 3: Isolate with test containers
- Each test gets its own database instance (via Docker container).
- No state pollution. Slower but most reliable.

RECOMMENDED: Solution 1 (cleanup fixture) for speed; Solution 3 (containers) for e2e tests.

Common Flaky Test Patterns and Fixes

PatternSymptomFix
Hardcoded sleepstime.sleep(5) then assertUse explicit waits: WebDriverWait(...).until(...)
Race conditionsAssertion fails intermittentlyWait for condition before asserting; use .waitFor()
Shared stateTest passes in isolation, fails in suiteAdd cleanup fixtures; use per-test databases
External callsNetwork timeouts cause failuresMock external deps; use retry logic; add timeouts
Non-determinismRandom UUIDs, timestampsSeed randomness; inject test doubles; use property tests

Key Takeaways

  • Flakiness is caused by timing, state pollution, non-determinism, or external dependencies.
  • Use AI to detect patterns in test logs and diagnose root causes.
  • Replace hardcoded sleeps with explicit waits on conditions.
  • Add cleanup fixtures to prevent test state pollution.
  • Seed or mock non-determinism to make tests reproducible.
  • Run tests in random order to detect order dependencies.

Frequently Asked Questions

How many test reruns indicate flakiness?

If a test fails on >5% of runs (1 in 20), it's flaky. Investigate immediately. Flaky tests destroy CI reliability and waste developer time.

Should I add retry logic to flaky tests?

Avoid retrying flaky tests in production CI. Instead, fix the root cause. Retries mask problems and slow down feedback. Use retries only for external calls (APIs) as a temporary measure while fixing the underlying issue.

How do I detect flakiness in CI without manual reruns?

Configure CI to re-run the full suite 3–5 times nightly. Aggregate results to identify tests that fail intermittently. Tools like testmon track test history automatically.

Can AI always identify the root cause of flakiness?

AI can spot patterns and suggest likely causes, but some flakiness requires execution to diagnose (e.g., race conditions that depend on exact timing). Use AI for initial triage, then manual testing/instrumentation for complex cases.

How do I fix flakiness in legacy tests without rewriting them?

Start with cleanup fixtures and explicit waits. Add logging to understand timing. Profile test execution to find slow operations. Gradually refactor as you understand root causes; avoid rewriting everything at once.

Further Reading