Using Pydantic Models to Enforce LLM Output Types
Pydantic is a Python library that bridges the gap between JSON schemas and type-safe Python code. Instead of writing JSON Schema manually, you define a Pydantic BaseModel class with type annotations. Pydantic automatically generates the schema, validates incoming data, and raises clear errors when data is malformed. For LLM applications, Pydantic eliminates boilerplate and gives you IDE autocomplete for LLM outputs.
Why Pydantic Over Raw JSON Schema
Without Pydantic, you write JSON Schema by hand and separately define Python classes to hold the data. With Pydantic, the class is the schema.
Without Pydantic (manual schema + validation):
import json
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string"}
},
"required": ["name", "age", "email"]
}
response = client.chat.completions.create(...)
data = json.loads(response.choices[0].message.content)
# Manual validation
assert isinstance(data.get("name"), str), "name must be string"
assert isinstance(data.get("age"), int), "age must be integer"
assert 0 <= data.get("age", 0) <= 150, "age out of range"
# ... more validation
With Pydantic (one definition, automatic validation):
from pydantic import BaseModel, Field
class Person(BaseModel):
name: str
age: int = Field(..., ge=0, le=150) # ge=greater-equal, le=less-equal
email: str
response = client.chat.completions.create(...)
data = json.loads(response.choices[0].message.content)
person = Person(**data) # Automatic validation
Basic Pydantic Models
A Pydantic BaseModel is a class with typed fields. When instantiated or validated, Pydantic checks types and constraints.
from pydantic import BaseModel, Field
from typing import Optional
class SentimentAnalysis(BaseModel):
"""Sentiment classification of a text."""
sentiment: str # Field with just type constraint
confidence: float = Field(..., ge=0, le=1) # ge/le = range constraint
explanation: str = Field(..., max_length=200) # max_length constraint
# Valid instantiation
analysis = SentimentAnalysis(
sentiment="positive",
confidence=0.95,
explanation="The text expresses strong approval."
)
# Invalid instantiation (raises ValidationError)
try:
bad = SentimentAnalysis(
sentiment="positive",
confidence=1.5, # Out of range!
explanation="x" * 300 # Too long!
)
except Exception as e:
print(f"Validation error: {e}")
Converting Pydantic Models to JSON Schema
Pydantic models automatically generate a JSON Schema compatible with LLM JSON Mode:
from pydantic import BaseModel, Field
class Customer(BaseModel):
name: str
email: str
phone: str
# Generate JSON Schema
schema = Customer.model_json_schema()
print(schema)
# Output:
# {
# "type": "object",
# "properties": {
# "name": {"type": "string"},
# "email": {"type": "string"},
# "phone": {"type": "string"}
# },
# "required": ["name", "email", "phone"]
# }
Using Pydantic with LLM JSON Mode
Pass the generated schema directly to the LLM API:
from openai import OpenAI
from pydantic import BaseModel, Field
client = OpenAI()
class ProductReview(BaseModel):
"""Extracted product review."""
product_name: str
rating: int = Field(..., ge=1, le=5)
positive_points: list[str] = Field(..., max_items=5)
negative_points: list[str] = Field(..., max_items=5)
recommendation: bool
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "user",
"content": "Extract review info from: 'The AirPods are great but battery dies fast.'"
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "ProductReview",
"schema": ProductReview.model_json_schema(),
"strict": True
}
}
)
# Parse LLM response into Pydantic model
review = ProductReview.model_validate_json(response.choices[0].message.content)
print(f"Product: {review.product_name}")
print(f"Rating: {review.rating}/5")
print(f"Recommendation: {review.recommendation}")
Optional Fields and Defaults
Mark fields as optional with Optional[T] or a default value:
from pydantic import BaseModel, Field
from typing import Optional
class Article(BaseModel):
title: str
content: str
author: str = "Anonymous" # Default value
tags: Optional[list[str]] = None # Optional field
word_count: int = Field(default=0, ge=0) # Default with constraint
# Valid instantiation with missing optional fields
article = Article(
title="Learning Python",
content="Python is great...",
# author, tags, word_count are optional/have defaults
)
print(article.author) # "Anonymous"
print(article.tags) # None
Enums and Constrained Strings
Use Python enums for fixed sets of values:
from enum import Enum
from pydantic import BaseModel, Field
class SentimentEnum(str, Enum):
POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"
class TextClassification(BaseModel):
sentiment: SentimentEnum
confidence: float = Field(..., ge=0, le=1)
# Valid
result = TextClassification(sentiment=SentimentEnum.POSITIVE, confidence=0.9)
# Also valid (string is auto-converted to enum)
result = TextClassification(sentiment="positive", confidence=0.9)
# Invalid (Pydantic raises ValidationError)
try:
bad = TextClassification(sentiment="confused", confidence=0.9)
except Exception as e:
print(f"Validation error: {e}")
Nested Pydantic Models
Compose models by nesting other models:
from pydantic import BaseModel, EmailStr
class Address(BaseModel):
street: str
city: str
country: str
class Person(BaseModel):
name: str
email: EmailStr # Built-in email validation
address: Address
phone_numbers: list[str] = []
# Nested instantiation
person = Person(
name="Alice",
email="[email protected]",
address=Address(street="123 Main St", city="NYC", country="USA"),
phone_numbers=["555-0123", "555-4567"]
)
print(person.address.city) # "NYC"
# Pydantic automatically generates nested schema
schema = Person.model_json_schema()
# schema includes Address as a definition
Validating LLM Responses with Pydantic
This is the real power: validate and parse LLM output in one line:
from openai import OpenAI
from pydantic import BaseModel, Field
import json
client = OpenAI()
class ExtractedEntity(BaseModel):
name: str
entity_type: str = Field(..., pattern="^(person|organization|location)$")
confidence: float = Field(..., ge=0, le=1)
class EntityExtraction(BaseModel):
entities: list[ExtractedEntity]
# Get LLM response
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Extract entities from..."}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "EntityExtraction",
"schema": EntityExtraction.model_json_schema(),
"strict": True
}
}
)
# Parse and validate in one call
extraction = EntityExtraction.model_validate_json(response.choices[0].message.content)
# Now use with IDE autocomplete
for entity in extraction.entities:
print(f"{entity.name} ({entity.entity_type}): {entity.confidence:.2f}")
Error Handling with Pydantic
Pydantic's ValidationError provides detailed information about what failed:
from pydantic import BaseModel, Field, ValidationError
class Product(BaseModel):
name: str
price: float = Field(..., gt=0)
stock: int = Field(..., ge=0)
# Parse LLM response that has invalid data
response_data = {
"name": "Widget",
"price": -10, # Invalid (must be > 0)
"stock": "many" # Invalid (must be integer)
}
try:
product = Product(**response_data)
except ValidationError as e:
print(e.json()) # Detailed error report
# Output: [
# {"loc": ["price"], "msg": "ensure this value is greater than 0"},
# {"loc": ["stock"], "msg": "value is not a valid integer"}
# ]
Real-World Example: Multi-Step LLM Pipeline
Use Pydantic models at each step of a multi-LLM workflow:
from pydantic import BaseModel, Field
from openai import OpenAI
client = OpenAI()
# Step 1: Classify email
class EmailClassification(BaseModel):
category: str = Field(..., pattern="^(support|sales|billing|feedback)$")
priority: str = Field(..., pattern="^(low|medium|high|urgent)$")
# Step 2: Extract action items
class ActionItem(BaseModel):
task: str
assignee: str
deadline: str
class ActionItems(BaseModel):
items: list[ActionItem]
def process_email(email_text):
# Step 1: Classify
response1 = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": f"Classify: {email_text}"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "EmailClassification",
"schema": EmailClassification.model_json_schema(),
"strict": True
}
}
)
classification = EmailClassification.model_validate_json(response1.choices[0].message.content)
# Step 2: Extract action items (use classification result in prompt)
response2 = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "user", "content": f"Extract action items from: {email_text}. Category: {classification.category}"}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "ActionItems",
"schema": ActionItems.model_json_schema(),
"strict": True
}
}
)
actions = ActionItems.model_validate_json(response2.choices[0].message.content)
return classification, actions
# Use it
category, actions = process_email("Please help me with my billing issue...")
print(f"Category: {category.category}")
for action in actions.items:
print(f" - {action.task} (assigned to {action.assignee})")
Key Takeaways
- Pydantic
BaseModelclasses eliminate manual JSON Schema writing; the class is the schema. - Type annotations (
name: str,age: int) enforce types automatically. - Field constraints (
Field(..., ge=0, le=100)) prevent invalid values without boilerplate validation code. - Nested models compose complex schemas elegantly.
model_json_schema()generates LLM-compatible JSON Schema.model_validate_json()parses and validates LLM responses in one call.- Enums lock in valid values; Pydantic auto-converts strings to enum members.
Frequently Asked Questions
Do I need to install extra dependencies for email validation?
The EmailStr type requires pydantic[email]. Install with pip install pydantic[email].
Can Pydantic models be serialized to JSON?
Yes. Use model.model_dump_json() to serialize to JSON, or model.model_dump() for a Python dict.
person = Person(name="Alice", email="[email protected]", ...)
json_string = person.model_dump_json()
python_dict = person.model_dump()
How do I handle fields that accept multiple types?
Use Union[Type1, Type2] or Field(..., discriminator=...) for discriminated unions. For LLM outputs, enums are often clearer.
from typing import Union
class Response(BaseModel):
status: str
data: Union[str, int, list[str]]
What if the LLM response has extra fields not in my Pydantic model?
By default, Pydantic ignores extra fields. To reject them, use ConfigDict:
from pydantic import ConfigDict, BaseModel
class StrictModel(BaseModel):
model_config = ConfigDict(extra="forbid") # Reject extra fields
name: str
Does Pydantic add latency to LLM applications?
Negligibly. Validation is fast (microseconds for typical schemas). The latency benefit of structured output far outweighs the parsing cost.