Reviewing AI-Generated Artifacts Against Specifications
When an AI system generates code from a spec, the output is rarely perfect on the first try. Some generated functions are complete and correct; others are incomplete, over-engineered, or subtly wrong. Reviewing AI-generated artifacts against the original spec is the quality gate that catches these issues before code reaches production. This article teaches the systematic approach to artifact review: checklists, automated validation, and human judgment.
The Artifact Review Process
Artifact review has three phases:
1. Automated Validation (machine checks)
├── Does the code parse (no syntax errors)?
├── Does it pass static analysis?
└── Does it match the spec schema?
2. Compliance Review (human, spec-focused)
├── Does it implement all required features?
├── Are error cases handled per spec?
└── Do responses match the spec schema exactly?
3. Quality Review (human, beyond spec)
├── Is the code readable and maintainable?
├── Are there performance issues?
└── Is it secure?
The review occurs in this order because automated checks are fast and catch obvious errors; compliance review ensures spec adherence; quality review ensures production-readiness.
Phase 1: Automated Validation
Before a human looks at generated code, run it through automated checks:
Check 1: Syntax and Parse Errors
# Python
python -m py_compile generated_code.py
# JavaScript
npx eslint generated_code.js
# Java
javac generated_code.java
If this fails, the code is fundamentally broken. Send it back for regeneration.
Check 2: Type Safety
For typed languages, run a type checker:
# Python with mypy
mypy --strict generated_code.py
# TypeScript
tsc --strict generated_code.ts
# Go
go build generated_code.go
Type errors indicate misunderstandings between the AI and the spec. Flag and investigate.
Check 3: Linting and Standards
# Python
flake8 generated_code.py
black --check generated_code.py
# JavaScript
prettier --check generated_code.js
eslint --format=json generated_code.js
Linting catches style issues, unused imports, and potential bugs. Pass linting before human review.
Check 4: Schema Validation (Critical for Spec Compliance)
If your spec defines schemas (JSON Schema, OpenAPI, Protocol Buffers), validate the generated code against them:
# Validate generated API response matches OpenAPI schema
import jsonschema
from openapi_spec_validator import validate
spec = load_spec("api_spec.yaml")
generated_responses = extract_response_types("generated_api.py")
for endpoint, schema in spec["paths"].items():
# Generate test request
test_response = generate_test_response(endpoint)
# Validate response against schema
try:
jsonschema.validate(test_response, schema["responses"]["200"])
print(f"✓ {endpoint} response matches schema")
except jsonschema.ValidationError as e:
print(f"✗ {endpoint} response DOES NOT match schema: {e}")
This catches the most critical errors: code that claims to implement the spec but returns wrong data shapes.
Phase 2: Compliance Review Checklist
Once automated checks pass, humans review against the spec. Use a detailed checklist:
SPEC COMPLIANCE REVIEW CHECKLIST
=================================
[ ] Endpoints / Functions
[ ] All endpoints from spec are implemented?
[ ] No extra unspecified endpoints?
[ ] HTTP methods (GET/POST/PATCH) are correct?
[ ] URL paths match spec exactly?
[ ] Request Validation
[ ] All required fields are validated?
[ ] Field types match spec (string, number, boolean, array)?
[ ] Constraints are enforced (minLength, maxLength, enum, regex)?
[ ] Invalid inputs are rejected with 400 Bad Request?
[ ] Error messages are descriptive (from spec)?
[ ] Response Format
[ ] Success response matches spec schema?
[ ] Status code is correct (200 for GET, 201 for POST, etc.)?
[ ] All required response fields are present?
[ ] No extra unspecified fields in response?
[ ] Timestamps are in ISO 8601 format (if spec requires)?
[ ] Numbers have correct precision (integers vs decimals)?
[ ] Error Handling
[ ] 400 Bad Request for invalid inputs?
[ ] 401 Unauthorized for missing/invalid auth?
[ ] 404 Not Found when resource doesn't exist?
[ ] 409 Conflict for duplicate/conflict scenarios?
[ ] 500 Server Error is handled gracefully (no stack traces leaked)?
[ ] Error response includes actionable error message?
[ ] Authentication & Authorization
[ ] Protected endpoints check authentication?
[ ] JWT tokens are validated (signature, expiry)?
[ ] Unauthorized requests are rejected (401)?
[ ] Admin-only endpoints check permissions?
[ ] Data Persistence
[ ] Data is persisted correctly (written to database)?
[ ] Sensitive data (passwords, tokens) are hashed/encrypted?
[ ] Transactions are atomic (all or nothing)?
[ ] Database constraints (unique, foreign key) are enforced?
[ ] Edge Cases (from spec examples)
[ ] Boundary values work (min, max, empty)?
[ ] Special characters in strings are handled?
[ ] Null/undefined values are handled per spec?
[ ] Concurrent requests don't corrupt data?
[ ] Performance
[ ] Response latency is within spec budget?
[ ] Large data sets don't cause timeout (pagination works)?
[ ] No N+1 database queries?
[ ] Caching is used per spec?
[ ] Logging & Observability
[ ] All errors are logged (for debugging)?
[ ] Request/response tracing is available?
[ ] Metrics are emitted (request count, latency, errors)?
Go through each item. If any check fails, file a bug and request regeneration.
Phase 3: Quality Review (Beyond Spec)
Once code is spec-compliant, review for quality:
QUALITY REVIEW CHECKLIST
=========================
[ ] Readability
[ ] Variable/function names are clear and self-documenting?
[ ] Code is logically organized (no spaghetti)?
[ ] Comments explain why, not what (code shows what)?
[ ] Complex logic has explanatory comments?
[ ] Maintainability
[ ] DRY principle (no duplicated code)?
[ ] Functions are small and focused (single responsibility)?
[ ] No magic numbers (use named constants)?
[ ] Error handling is centralized (not scattered)?
[ ] Performance
[ ] No obvious inefficiencies (nested loops, repeated queries)?
[ ] Caching is used appropriately?
[ ] Memory usage is reasonable (no large allocations in loops)?
[ ] Security
[ ] SQL injection is prevented (use parameterized queries)?
[ ] Authentication is enforced on sensitive endpoints?
[ ] Rate limiting prevents abuse?
[ ] Secrets (API keys, passwords) are not hardcoded?
[ ] HTTPS/TLS is required for sensitive data?
[ ] Testing
[ ] Are there tests for happy path?
[ ] Are there tests for error cases?
[ ] Test coverage is adequate (>80%)?
[ ] Tests are not brittle (don't break on refactoring)?
[ ] Documentation
[ ] Function docstrings explain inputs, outputs, side effects?
[ ] README includes setup and usage instructions?
[ ] Complex algorithms have detailed comments?
[ ] API documentation is auto-generated or clear?
Issues found here are improvement opportunities, not blockers. File them as follow-up tasks.
Automated Compliance Checking
Create a tool that automatically validates generated code against a spec:
# spec_validator.py
import ast
import inspect
from typing import Any
class SpecValidator:
def __init__(self, spec: dict):
self.spec = spec
def validate_endpoints(self, code_module):
"""Verify all spec endpoints are implemented"""
spec_endpoints = set(self.spec["paths"].keys())
code_functions = set(self._extract_functions(code_module))
missing = spec_endpoints - code_functions
extra = code_functions - spec_endpoints
if missing:
return False, f"Missing endpoints: {missing}"
if extra:
return False, f"Extra unspecified endpoints: {extra}"
return True, "All endpoints implemented"
def validate_request_schema(self, code_module, endpoint: str):
"""Verify request parsing matches spec"""
spec_schema = self.spec["paths"][endpoint]["requestBody"]["schema"]
code_function = getattr(code_module, endpoint)
# Extract parameter validation from code
source = inspect.getsource(code_function)
validated_params = self._extract_validations(source)
# Compare against spec
required_params = spec_schema.get("required", [])
for param in required_params:
if param not in validated_params:
return False, f"Parameter {param} not validated"
return True, "Request validation complete"
def validate_response_schema(self, code_module, endpoint: str):
"""Verify response matches spec"""
spec_schema = self.spec["paths"][endpoint]["responses"]["200"]["schema"]
# Execute code and check response
response = self._execute_test_request(code_module, endpoint)
# Validate response matches schema
from jsonschema import validate, ValidationError
try:
validate(instance=response, schema=spec_schema)
return True, "Response matches schema"
except ValidationError as e:
return False, f"Response validation failed: {e}"
def validate_all(self, code_module):
"""Run all validations"""
results = []
# Check endpoints
ok, msg = self.validate_endpoints(code_module)
results.append(("Endpoints", ok, msg))
# Check each endpoint's request and response
for endpoint in self.spec["paths"]:
ok, msg = self.validate_request_schema(code_module, endpoint)
results.append((f"Request: {endpoint}", ok, msg))
ok, msg = self.validate_response_schema(code_module, endpoint)
results.append((f"Response: {endpoint}", ok, msg))
return results
# Usage
spec = load_spec("api_spec.yaml")
validator = SpecValidator(spec)
generated_code = import_module("generated_api")
results = validator.validate_all(generated_code)
for check_name, ok, message in results:
status = "✓" if ok else "✗"
print(f"{status} {check_name}: {message}")
failed = sum(1 for _, ok, _ in results if not ok)
print(f"\nTotal: {len(results)} checks, {failed} failures")
This tool automates the tedious parts of compliance review.
Handling Review Failures
When code fails review, don't just reject it—provide actionable feedback:
REVIEW FAILURE REPORT
=====================
Generated Code: user_service.py
Spec: api_spec.yaml
Review Date: 2026-06-02
FAILURES:
1. ✗ Endpoint GET /users/{userId} does not return 404 when user not found
- Spec requires: "404 Not Found" response when user ID not in database
- Code currently: Returns 200 with user=null
- Action: Regenerate with check: if user is None, raise 404
2. ✗ POST /users request validation missing email format check
- Spec requires: "email must match RFC 5322"
- Code currently: Only checks that email is a string
- Action: Add email validation: re.match(RFC5322_PATTERN, email)
3. ✗ Response schema mismatch: createdAt field
- Spec requires: ISO 8601 datetime (e.g., "2026-06-02T10:30:00Z")
- Code currently: Unix timestamp (e.g., 1748942400)
- Action: Convert createdAt to ISO 8601 in response
RECOMMENDATION: Regenerate with the above feedback, then re-review.
Share this report with the AI system (or human developer) so they can fix the issues.
Comparison Table: Review Approaches
| Approach | Automation | Coverage | Effort | Reliability |
|---|---|---|---|---|
| Manual checklist | Low | Medium (human forgets items) | High | Medium |
| Automated + manual | Medium | High | Medium | High |
| Fully automated | High | Partial (misses quality nuances) | Low | Medium |
Key Takeaways
- Artifact review has three phases: automated validation (machine checks), compliance review (spec adherence), and quality review (beyond spec).
- Automated schema validation is critical: it catches the most common AI errors (wrong response types, missing fields).
- Compliance checklists ensure spec requirements are honored; quality checklists ensure production-readiness.
- Rejection reports should be actionable, not just "this failed"; tell the AI/developer exactly what to fix.
- Tool support (automated validators) reduces review time and catches errors humans miss.
Frequently Asked Questions
How much of review can be automated?
Syntax, type errors, and schema validation: 90%+. Spec compliance: 70–80%. Quality and performance: 40–50%. Security: 50%. Combine automation with human judgment.
What if AI-generated code passes all checks but feels wrong?
Trust your instinct. An automated check missed something. Investigate why you feel uneasy (often correctness intuitions are right). Update the spec or the validator to catch the issue.
How do I review code I don't understand?
Ask the AI to explain it first. Generate docstrings. Ask for simplification. If code can't be understood by a reasonable developer, it's too complex and should be regenerated.
Can I automate the quality review checklist?
Partially. Use linters (flake8, eslint), static analysis (SonarQube), and security scanners (Snyk). These catch 50–70% of quality issues. Humans review the rest.