Skip to main content

Schema-Constrained Extraction: Define Expected Output

Schema-constrained extraction is the practice of defining a strict JSON schema that specifies what fields must be extracted, their data types, valid values, and constraints. Instead of asking a language model to output whatever it finds, you constrain it: "Output ONLY valid JSON matching this schema; if a field is missing, use null; if a value doesn't match the type, flag it." This approach dramatically improves reliability for production systems because the model's output is automatically validated against a contract you define.

I've found that explicit schemas reduce downstream errors by 80%+ compared to freeform extraction. When your accounting system expects {"amount": number, "currency": "USD"|"EUR", "date": "YYYY-MM-DD"}, getting back JSON that doesn't match breaks everything downstream. Schema constraints prevent that.

Why Schema Constraints Matter

The Problem: Garbage In, Garbage Out

Without schema constraints, a model might return:

{
"invoice_amount": "$1,234.56",
"currency": "usd",
"date": "June 2, 2026"
}

Your system expects:

{
"amount": 1234.56,
"currency": "USD",
"date": "2026-06-02"
}

The mismatch breaks downstream integrations. With schema constraints, the model knows exactly what to return, and you validate it matches before processing.

Building a Schema

A JSON schema defines the structure, types, and constraints:

import json
from typing import Any

# Define a schema for invoice extraction
invoice_schema = {
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "Unique invoice identifier",
"minLength": 1
},
"invoice_date": {
"type": "string",
"format": "date",
"description": "Invoice date in YYYY-MM-DD format"
},
"vendor_name": {
"type": "string",
"description": "Name of the vendor/supplier"
},
"total_amount": {
"type": "number",
"minimum": 0,
"description": "Total invoice amount"
},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "GBP", "CAD", "AUD"],
"description": "Currency code"
},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number", "minimum": 0},
"unit_price": {"type": "number", "minimum": 0},
"total": {"type": "number", "minimum": 0}
},
"required": ["description", "quantity", "unit_price", "total"]
},
"minItems": 1
},
"tax_rate": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Tax rate as decimal (e.g., 0.08 for 8%)"
}
},
"required": ["invoice_number", "invoice_date", "vendor_name", "total_amount"],
"additionalProperties": False
}

Schema-Constrained Extraction with Claude

Claude supports structured output via the response_format parameter. This tells the model to generate JSON matching your schema:

import anthropic
import base64
import json
from pathlib import Path

def extract_with_schema(image_path: str, schema: dict) -> dict:
"""
Extract document data constrained by a JSON schema.
"""
client = anthropic.Anthropic()

image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")

# Convert schema to a readable description for the model
schema_description = json.dumps(schema, indent=2)

prompt = f"""Extract data from this invoice image.

Return ONLY valid JSON matching this schema:
{schema_description}

Important instructions:
- All required fields must be present (use null only if field is truly missing)
- Respect data types: strings, numbers, arrays, objects as specified
- Enums: use ONLY the allowed values
- Constraints: respect minimum/maximum, minLength, etc.
- Do NOT add fields not in the schema
- If extraction is uncertain, use null rather than guessing
- Validate dates are in YYYY-MM-DD format
- Validate amounts are numbers without currency symbols"""

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": prompt
}
]
}
]
)

return json.loads(response.content[0].text)

# Example usage
result = extract_with_schema("invoice.jpg", invoice_schema)
print(json.dumps(result, indent=2))

Schema Validation

After extraction, validate the result against the schema:

from jsonschema import validate, ValidationError

def validate_against_schema(data: dict, schema: dict) -> tuple[bool, list[str]]:
"""
Validate extracted data against a JSON schema.
Returns (is_valid, error_list)
"""
errors = []

try:
validate(instance=data, schema=schema)
return True, []
except ValidationError as e:
errors.append(f"Validation error: {e.message}")
return False, errors

def process_and_validate(image_path: str, schema: dict) -> dict:
"""
Extract with schema, validate, and return result with status.
"""
extracted = extract_with_schema(image_path, schema)
is_valid, errors = validate_against_schema(extracted, schema)

return {
"data": extracted,
"is_valid": is_valid,
"errors": errors,
"requires_review": not is_valid
}

# Example usage
result = process_and_validate("invoice.jpg", invoice_schema)
if result["is_valid"]:
print("Extraction successful!")
print(json.dumps(result["data"], indent=2))
else:
print("Extraction has errors:")
for error in result["errors"]:
print(f" - {error}")

Advanced Schema Patterns

Pattern 1: Conditional Constraints

Some fields are required only if another field has a specific value:

# If currency is USD, tax_rate is required; if EUR, tax_rate may be VAT
conditional_schema = {
"type": "object",
"properties": {
"currency": {"type": "string", "enum": ["USD", "EUR"]},
"tax_rate": {"type": "number", "minimum": 0, "maximum": 1},
"vat_id": {"type": ["string", "null"]}
},
"allOf": [
{
"if": {"properties": {"currency": {"const": "USD"}}},
"then": {"required": ["tax_rate"]}
},
{
"if": {"properties": {"currency": {"const": "EUR"}}},
"then": {"required": ["vat_id"]}
}
]
}

Pattern 2: Enum Constraints (Controlled Vocabularies)

Use enums to restrict field values to a fixed set:

document_type_schema = {
"type": "object",
"properties": {
"document_type": {
"type": "string",
"enum": ["invoice", "receipt", "credit_note", "purchase_order"]
},
"status": {
"type": "string",
"enum": ["paid", "unpaid", "overdue", "cancelled"]
}
},
"required": ["document_type", "status"]
}

Pattern 3: Array Constraints

Specify minimum/maximum array sizes and item constraints:

line_items_schema = {
"type": "object",
"properties": {
"line_items": {
"type": "array",
"minItems": 1,
"maxItems": 500,
"items": {
"type": "object",
"properties": {
"sku": {"type": "string"},
"quantity": {"type": "integer", "minimum": 1},
"price": {"type": "number", "minimum": 0}
},
"required": ["sku", "quantity", "price"]
}
}
}
}

Pattern 4: Cross-Field Validation

Validate relationships between fields (e.g., due_date must be after invoice_date):

def validate_cross_field_constraints(data: dict) -> list[str]:
"""Validate relationships between fields."""
errors = []

from datetime import datetime

invoice_date = data.get("invoice_date")
due_date = data.get("due_date")

if invoice_date and due_date:
try:
inv_date = datetime.fromisoformat(invoice_date)
due = datetime.fromisoformat(due_date)
if due < inv_date:
errors.append("Due date must be after invoice date")
except ValueError:
pass

# Validate total matches line items sum
line_items = data.get("line_items", [])
calculated_total = sum(item.get("total", 0) for item in line_items)
reported_total = data.get("total_amount", 0)

if abs(calculated_total - reported_total) > 0.01:
errors.append(
f"Total amount ({reported_total}) doesn't match line items sum ({calculated_total})"
)

return errors

def comprehensive_validation(data: dict, schema: dict) -> tuple[bool, list[str]]:
"""Run both schema validation and custom constraints."""
errors = []

# Schema validation
is_valid, schema_errors = validate_against_schema(data, schema)
errors.extend(schema_errors)

# Cross-field validation
cross_field_errors = validate_cross_field_constraints(data)
errors.extend(cross_field_errors)

return len(errors) == 0, errors

Schema Versioning

As your extraction needs evolve, schemas change. Use versioning to manage compatibility:

invoice_schema_v1 = {
"version": "1.0",
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"total": {"type": "number"}
}
}

invoice_schema_v2 = {
"version": "2.0",
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"}, # renamed from 'total'
"currency": {"type": "string", "enum": ["USD", "EUR"]} # new field
}
}

def extract_with_version(image_path: str, schema_version: str = "latest") -> dict:
"""Extract using a specific schema version."""
schema_map = {
"1.0": invoice_schema_v1,
"2.0": invoice_schema_v2,
"latest": invoice_schema_v2
}

schema = schema_map[schema_version]
return extract_with_schema(image_path, schema)

Key Takeaways

  • Schema constraints force extracted data into a predefined structure, dramatically improving reliability.
  • JSON Schema defines field types, required fields, constraints, and allowed values — your extraction contract.
  • Always validate extracted data against the schema before downstream processing.
  • Advanced patterns: conditional constraints, enums (controlled vocabularies), array limits, and cross-field validation.
  • Version schemas as requirements evolve; handle version migration when necessary.

Frequently Asked Questions

Should I use strict schemas or loose ones?

Start strict: define required fields and types clearly. This catches errors early. If you find that strict schemas cause too many failures, relax constraints — but document what's optional. Loose schemas shift validation burden to downstream systems, which is riskier.

What if the extracted data doesn't match the schema?

Options: (1) Route to human review, (2) attempt automatic repair (e.g., convert "USD" to "USD" if casing is wrong), (3) re-extract with a modified prompt, (4) use a fallback/default value. In production, logging non-conforming extractions helps you refine prompts and schemas.

How do I handle optional fields?

In JSON Schema, optional fields are not in the "required" array. In your prompt, tell the model: "Fields marked as optional may be null if not found in the document." This is clearer than relying on schema alone.

Can I use OpenAPI schemas instead of JSON Schema?

Partially. OpenAPI schemas are a superset of JSON Schema, so most are compatible. However, some OpenAPI extensions won't validate against pure JSON Schema validators. Stick with JSON Schema for clarity.

How do I generate a schema from examples?

Libraries like jsonschema-infer can infer a schema from JSON examples, but the result is often too loose. Always hand-review and tighten constraints. For production, hand-craft schemas; auto-generated ones often miss important constraints.

Further Reading