Skip to main content

Testing Generated Code: Quality Verification

Generated code is not automatically production-ready. Before integrating it into your project, you must verify correctness, security, and quality. This article teaches a systematic verification process: unit tests, integration tests, security checks, and style review.

The Verification Checklist

Before considering generated code ready:

  • [ ] Unit tests pass (happy path + edge cases + error cases)
  • [ ] Integration tests pass (code works with your codebase)
  • [ ] No security vulnerabilities (code review for injection, auth, crypto)
  • [ ] Code style matches project standards (linter passes)
  • [ ] Type checking passes (mypy, TypeScript, etc.)
  • [ ] Performance meets targets (if specified)
  • [ ] Docstrings and comments are accurate
  • [ ] Error handling is appropriate (no silent failures)

Unit Testing Generated Code

Write unit tests before or immediately after generation. Tests serve two purposes: (1) verify the code is correct, (2) catch regressions if you modify the code later.

# Generated function
def celsius_to_fahrenheit(celsius: float) -> float:
"""Convert Celsius to Fahrenheit."""
return (celsius * 9/5) + 32

# Unit tests
import pytest

def test_freezing_point():
"""0°C = 32°F"""
assert celsius_to_fahrenheit(0) == 32.0

def test_boiling_point():
"""100°C = 212°F"""
assert celsius_to_fahrenheit(100) == 212.0

def test_below_absolute_zero():
"""Below -273.15°C is invalid."""
with pytest.raises(ValueError):
celsius_to_fahrenheit(-274)

def test_type_error():
"""Non-numeric input should raise TypeError."""
with pytest.raises(TypeError):
celsius_to_fahrenheit("zero")

def test_negative_temperature():
"""-40°C = -40°F (same in both scales)"""
assert celsius_to_fahrenheit(-40) == -40.0

Run these tests to verify the generated function:

pytest test_conversion.py -v
# Output:
# test_freezing_point PASSED
# test_boiling_point PASSED
# test_below_absolute_zero PASSED
# test_type_error PASSED
# test_negative_temperature PASSED

If all tests pass, the generated code is likely correct. If a test fails, iterate on your prompt with the failure as feedback.

Integration Testing

Generated code rarely exists in isolation. It must work with your existing codebase, database, APIs, and other dependencies. Integration tests verify this.

# Example: Generated auth function integrated with FastAPI
from fastapi import FastAPI, Depends
from app.auth import authenticate_user, get_current_user
from app.db import get_db

app = FastAPI()

@app.post("/login")
def login(email: str, password: str, db: Session = Depends(get_db)):
user = authenticate_user(email, password, db)
if not user:
raise HTTPException(status_code=401, detail="Invalid credentials")
return {"user_id": user.id, "email": user.email}

# Integration test
def test_login_flow(client, db):
"""Test complete login flow with database."""
# Create test user
user = User(email="[email protected]", password_hash=hash_password("pass123"))
db.add(user)
db.commit()

# Test login
response = client.post("/login", json={"email": "[email protected]", "password": "pass123"})
assert response.status_code == 200
assert response.json()["email"] == "[email protected]"

# Test invalid password
response = client.post("/login", json={"email": "[email protected]", "password": "wrong"})
assert response.status_code == 401

Integration tests catch issues that unit tests miss: database schema mismatches, API contract violations, missing imports.

Security Review

Generated code can contain security vulnerabilities. Review for:

1. Injection Vulnerabilities

Check for SQL injection, command injection, template injection.

# VULNERABLE: SQL injection
def find_user(user_id: str, db: Session):
query = f"SELECT * FROM users WHERE id = {user_id}"
return db.execute(query)

# SECURE: Parameterized query
def find_user(user_id: int, db: Session):
from sqlalchemy import select, User
stmt = select(User).where(User.id == user_id)
return db.execute(stmt).scalar_one_or_none()

2. Authentication and Authorization

Check that sensitive operations verify permissions.

# VULNERABLE: No auth check
@app.delete("/users/{user_id}")
def delete_user(user_id: int, db: Session):
user = db.query(User).filter(User.id == user_id).delete()
db.commit()
return {"deleted": True}

# SECURE: Check that requester is authorized
@app.delete("/users/{user_id}")
def delete_user(user_id: int, current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
if current_user.id != user_id and not current_user.is_admin:
raise HTTPException(status_code=403, detail="Not authorized")
user = db.query(User).filter(User.id == user_id).delete()
db.commit()
return {"deleted": True}

3. Credential Handling

Check that secrets, API keys, and passwords are not hardcoded.

# VULNERABLE: Hardcoded API key
api_key = "sk-proj-abc123xyz"
response = requests.get("https://api.example.com/data", headers={"Authorization": f"Bearer {api_key}"})

# SECURE: Load from environment
import os
api_key = os.getenv("API_KEY")
if not api_key:
raise ValueError("API_KEY environment variable not set")
response = requests.get("https://api.example.com/data", headers={"Authorization": f"Bearer {api_key}"})

4. Cryptography

Check that crypto is current and correctly implemented.

# WEAK: MD5 hash (deprecated)
import hashlib
password_hash = hashlib.md5(password.encode()).hexdigest()

# STRONG: bcrypt with salt
import bcrypt
password_hash = bcrypt.hashpw(password.encode(), bcrypt.gensalt())
is_match = bcrypt.checkpw(password.encode(), password_hash)

For security-critical code, consider an external security review or automated scanning (SAST tools like SonarQube, Semgrep, Snyk).

Style and Lint Checking

Run your project's linters to ensure code matches style standards.

# Python
black --check generated_code.py
ruff check generated_code.py
mypy --strict generated_code.py

# JavaScript/TypeScript
prettier --check generated_code.js
eslint generated_code.js
tsc --strict --noEmit generated_code.ts

# Rust
cargo fmt --check
cargo clippy -- -D warnings

If linters fail, either (1) regenerate with style constraints in the prompt, or (2) auto-fix with the linter:

# Auto-fix
black generated_code.py
ruff check --fix generated_code.py

Performance Testing

If your prompt specified performance targets, test them:

import time

def test_performance_parse_large_csv():
"""Parse 1 million rows in under 5 seconds."""
large_csv = generate_test_csv(1_000_000)

start = time.time()
result = parse_csv(large_csv)
elapsed = time.time() - start

assert elapsed < 5.0, f"Parsing took {elapsed}s, target was <5s"
assert len(result) == 1_000_000

Run performance tests in a controlled environment (same CPU, same load, same data size) to get consistent results.

Code Review Checklist

Beyond automated checks, manually review:

ItemCheckExample
Logic correctnessDoes the code do what was asked?Check examples from the prompt
Edge casesAre edge cases handled?Empty lists, null values, boundary conditions
Error messagesAre errors clear and actionable?"port out of range (1-65535)" vs. "invalid port"
DocumentationAre docstrings present and accurate?Check that examples in docstring match actual behavior
EfficiencyIs the code reasonably efficient?No N+1 queries, no unnecessary copying
MaintainabilityWould another developer understand this code?Complex logic should have comments

Example: Complete Verification Flow

# Generated function
def extract_emails_from_text(text: str) -> list[str]:
"""Extract unique email addresses from text."""
import re
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
return list(set(emails))

# Step 1: Unit tests
def test_simple_email():
assert "[email protected]" in extract_emails_from_text("Contact: [email protected]")

def test_duplicate_removal():
text = "[email protected] and [email protected]"
assert extract_emails_from_text(text) == ["[email protected]"]

def test_empty_text():
assert extract_emails_from_text("") == []

# Run tests
# pytest verify_code.py -v → All pass ✓

# Step 2: Integration test (e.g., used in a web form)
def test_form_submission(client):
response = client.post("/subscribe", json={"text": "Email me at [email protected]"})
assert response.status_code == 200
assert response.json()["emails"] == ["[email protected]"]

# Step 3: Security check
# Review: Email regex doesn't validate RFC 5322, but sufficient for this use case
# No injection vulnerabilities (regex is from input, not user-controlled)
# No auth required (public endpoint, no sensitive operations)
# ✓ Passes security review

# Step 4: Lint check
# ruff check → No issues ✓
# mypy --strict → No issues ✓

# Step 5: Code review
# Logic: Correct ✓
# Edge cases: Empty strings, duplicates handled ✓
# Documentation: Docstring present ✓
# Efficiency: Single regex pass, O(n) complexity ✓

# Conclusion: Ready for production

Key Takeaways

  • Verify generated code through unit tests, integration tests, security review, and style checking.
  • Write tests before deployment; they catch bugs and prevent regressions.
  • Review for security: check for injection, auth, credential handling, and crypto issues.
  • Run linters (mypy, ESLint, clippy) to ensure code matches project style.
  • Test performance against your specified targets.
  • Perform manual code review to catch issues automated tools miss.

Frequently Asked Questions

Do I need to test every generated function?

Yes, especially for functions that handle data, make API calls, or enforce business logic. For simple utility functions (e.g., string formatters), basic sanity tests are enough.

How many tests should I write?

At minimum: one happy-path test, one error case, one edge case per function. For complex functions, more is better. Aim for 70%+ code coverage on generated code.

Can I use code generation to write the tests themselves?

Yes, you can prompt the AI model to generate unit tests. Example: "Write pytest tests for the function [function_signature]. Include happy path, edge cases, and error handling." Then review and run those tests against the generated function.

What if the generated code fails security review?

Iterate on your prompt to add security requirements. Example: "Use parameterized queries (not string concatenation). Never trust user input for SQL." Then regenerate.

Should I refactor generated code that works but is inefficient?

If performance meets targets, don't refactor immediately. But if it's obviously inefficient (nested loops, unnecessary cloning), ask the AI model to optimize: "This function processes 1 million rows. Current version takes 30 seconds; target is <5 seconds. Optimize using [algorithm/technique]."

Further Reading