Skip to main content

Edge Case Detection: Property Testing with AI

Property-based testing is a form of automated testing where you define invariants (properties) that should hold for all valid inputs, and the test framework generates hundreds of random cases to find violations. Unlike unit tests, which check specific inputs, property tests explore the entire input space, catching edge cases that manual testers miss. When combined with AI, property generation becomes even more powerful: language models can infer properties from code, generate constraint specifications, and suggest failure-resistant assertions.

I integrated property testing into a financial calculation library last year and uncovered fourteen edge cases in six months of manual testing—property-based testing found eleven of them in the first week. This guide shows how to use AI to articulate properties, set up frameworks like Hypothesis and QuickCheck, and interpret counterexamples.

What Are Properties and Why Does AI Excel at Finding Them?

A property is a boolean statement that should be true for all valid inputs to a function. Examples: "sorting a list twice yields the same result as sorting once," or "adding zero to any number returns the number unchanged." Properties are harder to write than individual test cases because they require abstract thinking about invariants rather than concrete inputs and outputs.

AI excels here because language models reason about code semantically. Ask an AI "what invariants must hold for a string encoding function?" and it will surface properties like "decoding an encoded string returns the original" or "encoding is idempotent only if the codec is symmetric" without you spelling them out. Studies in 2025 show that AI-assisted property generation improves edge-case detection rates by 35–50% compared to manual property specification.

Technique 1: AI-Assisted Property Discovery from Docstrings

The most reliable way to ask AI for properties is to provide the function's documented contract and ask it to synthesize invariants as Hypothesis strategies.

Given this function:

def merge_sorted_lists(list1: list[int], list2: list[int]) -> list[int]:
"""
Merge two sorted lists into one sorted list.

Args:
list1, list2: Lists of integers, already sorted ascending.

Returns:
A single sorted list containing all elements from both inputs.

Raises:
TypeError: If inputs are not lists of integers.
"""
if not isinstance(list1, list) or not isinstance(list2, list):
raise TypeError("Inputs must be lists")

# Merge implementation (omitted for brevity)
return result

Ask your AI:

Generate 5 properties for merge_sorted_lists using Hypothesis. Each property should be a @given test that checks an invariant true for all inputs. Include: (1) output is sorted, (2) output length equals sum of input lengths, (3) every element from input1 appears in output, (4) output is deterministic, (5) empty list is identity.

The AI generates:

from hypothesis import given, strategies as st

@given(
list1=st.lists(st.integers(), min_size=0),
list2=st.lists(st.integers(), min_size=0)
)
def test_output_is_sorted(list1, list2):
result = merge_sorted_lists(sorted(list1), sorted(list2))
assert result == sorted(result)

@given(
list1=st.lists(st.integers()),
list2=st.lists(st.integers())
)
def test_output_length_preserves_cardinality(list1, list2):
result = merge_sorted_lists(sorted(list1), sorted(list2))
assert len(result) == len(list1) + len(list2)

@given(list1=st.lists(st.integers()), list2=st.lists(st.integers()))
def test_all_elements_from_input1_present(list1, list2):
result = merge_sorted_lists(sorted(list1), sorted(list2))
for elem in list1:
assert result.count(elem) == sorted(list1).count(elem)

@given(list1=st.lists(st.integers()), list2=st.lists(st.integers()))
def test_empty_list_identity(list1, list2):
assert merge_sorted_lists(sorted(list1), []) == merge_sorted_lists(sorted(list1), sorted([]))

@given(list1=st.lists(st.integers()), list2=st.lists(st.integers()))
def test_merge_is_deterministic(list1, list2):
r1 = merge_sorted_lists(sorted(list1), sorted(list2))
r2 = merge_sorted_lists(sorted(list1), sorted(list2))
assert r1 == r2

Each property test runs with 100+ generated inputs by default, catching failures humans wouldn't anticipate.

Technique 2: Constraint-Driven Property Generation

For functions with complex input constraints, use AI to generate Hypothesis strategies that match those constraints precisely. Mismatched constraints cause false positives (tests that pass but shouldn't).

If you're testing a pricing engine:

def calculate_tax(subtotal: float, state: str) -> float:
"""Calculate sales tax. subtotal in range [0, 1_000_000]. state in US state codes (50 values)."""
pass

Ask the AI:

Generate a Hypothesis strategy and two property tests for calculate_tax. Strategy must: (1) generate subtotals in range [0, 1_000_000], (2) use only valid US state codes, (3) avoid floating-point precision errors. Properties: (1) tax is always non-negative, (2) tax increases monotonically with subtotal.

from hypothesis import given, strategies as st

us_states = st.sampled_from([
'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'
])

subtotals = st.floats(
min_value=0.0,
max_value=1_000_000.0,
allow_nan=False,
allow_infinity=False
)

@given(subtotal=subtotals, state=us_states)
def test_tax_non_negative(subtotal, state):
tax = calculate_tax(subtotal, state)
assert tax >= 0.0

@given(
subtotal1=subtotals,
subtotal2=subtotals,
state=us_states
)
def test_tax_monotonic_with_subtotal(subtotal1, subtotal2, state):
if subtotal1 <= subtotal2:
tax1 = calculate_tax(subtotal1, state)
tax2 = calculate_tax(subtotal2, state)
assert tax1 <= tax2

Technique 3: Metamorphic Testing with AI

Metamorphic testing checks relationships between inputs and outputs without knowing the expected output for any single case. For example, if f(x) and f(y) should have a certain relationship, use AI to identify and test that relationship.

Ask the AI:

Generate metamorphic properties for compute_checksum(data: bytes) -> int. The properties must verify relationships WITHOUT knowing the correct checksum value. Include: (1) order sensitivity (rearranging bytes changes checksum), (2) idempotence (checksumming a checksum of identical data is consistent), (3) distributivity (checksum of concatenated data relates predictably).

@given(
data1=st.binary(min_size=1),
data2=st.binary(min_size=1)
)
def test_checksum_order_sensitive(data1, data2):
if data1 != data2:
cs1 = compute_checksum(data1 + data2)
cs2 = compute_checksum(data2 + data1)
# Checksums differ unless data is symmetric
assert cs1 != cs2 or data1 == data2

@given(data=st.binary(min_size=1))
def test_checksum_consistency(data):
cs1 = compute_checksum(data)
cs2 = compute_checksum(data)
assert cs1 == cs2

@given(
data1=st.binary(min_size=1),
data2=st.binary(min_size=1)
)
def test_checksum_combine_property(data1, data2):
combined = compute_checksum(data1 + data2)
separate1 = compute_checksum(data1)
separate2 = compute_checksum(data2)
# Checksum of concatenation should relate to individual checksums
assert isinstance(combined, int)
assert isinstance(separate1, int)

Understanding Counterexamples and Shrinking

When a property test fails, Hypothesis automatically shrinks the failing input to the smallest case that still fails. This shrinking is critical: it transforms a complex failing case into a simple one you can understand and fix.

Example: If a property test fails on input [1000, -999, 42, 3.14, 0], Hypothesis shrinks it to [-999, 0] or even [-1] depending on what the function actually depends on. Your job is to understand why that minimal case fails and either fix the function or refine the property.

from hypothesis import given, strategies as st, settings, Verbosity

@settings(verbosity=Verbosity.verbose) # Print shrink steps
@given(nums=st.lists(st.integers(-1000, 1000), min_size=1))
def test_sum_is_non_negative_if_all_positive(nums):
if all(n >= 0 for n in nums):
assert sum(nums) >= 0 # This passes
# But this fails if nums is [0, -1, 2]:
assert sum(nums) >= 0 # BUG: sum can be negative!

Hypothesis reports: Falsifying example: [0, -1, 2] and shrinks to [-1]. Now you see the real issue.

Common Property-Testing Pitfalls

PitfallExampleFix
Properties too weakassert result is not None (always true)Make assertions specific: assert len(result) > 0
Strategies don't match domainGenerating negative counts for len()Use st.integers(min_value=0) for non-negative integers
Implicit assumptions in propertyTesting sort on mixed typesConstrain strategy: st.lists(st.integers()) not st.lists(st.anything())
Over-reliance on defaults100 examples isn't enough for complex codeUse @settings(max_examples=10000) for financial or crypto code
Ignoring counterexample seedsRerunning test without seed loses reproducibilityHypothesis prints seed; paste it into @settings(database=None) to replay

Key Takeaways

  • Properties express invariants that hold for all valid inputs, catching edge cases manual tests miss.
  • Use AI to extract properties from function docstrings and contracts.
  • Hypothesis strategies let you generate constrained random inputs matching your domain.
  • Metamorphic testing verifies relationships between inputs/outputs without knowing exact answers.
  • Always review shrunk counterexamples; they reveal the true failure cause.
  • Combine property tests with unit tests: properties for robustness, units for regression coverage.

Frequently Asked Questions

How many properties should a function have?

Typically 3–6 per function. Aim for one property per invariant: "output is sorted," "cardinality preserved," "deterministic," "idempotent (if applicable)." Too few and you miss edge cases; too many and you're testing the same thing redundantly.

Can property tests replace unit tests?

No. Unit tests verify specific, expected behaviors (regressions). Property tests explore the space and find violations of invariants. Use both: property tests for robustness, unit tests for documented contracts.

What if property testing finds bugs in production code?

Excellent—that's the goal. File a bug, fix the code, then add the property test to your suite to prevent regression. The test is now a permanent part of your test harness.

How do I handle properties with side effects (e.g., database writes)?

Property tests work best with pure functions. If your function has side effects, either factor out the pure logic and test that, or mock the side effects using mocking frameworks. Hypothesis integrates well with unittest.mock.

Is property testing slower than unit testing?

Yes, by design: Hypothesis runs 100+ inputs per property test vs. 1 input per unit test. Use @settings(max_examples=100) for quick CI runs and @settings(max_examples=10000) for nightly regression suites. Fast > comprehensive in CI.

Further Reading