Skip to main content

Using Pydantic Models to Enforce LLM Output Types

Pydantic is a Python library that bridges the gap between JSON schemas and type-safe Python code. Instead of writing JSON Schema manually, you define a Pydantic BaseModel class with type annotations. Pydantic automatically generates the schema, validates incoming data, and raises clear errors when data is malformed. For LLM applications, Pydantic eliminates boilerplate and gives you IDE autocomplete for LLM outputs.

Why Pydantic Over Raw JSON Schema

Without Pydantic, you write JSON Schema by hand and separately define Python classes to hold the data. With Pydantic, the class is the schema.

Without Pydantic (manual schema + validation):

import json

schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string"}
},
"required": ["name", "age", "email"]
}

response = client.chat.completions.create(...)
data = json.loads(response.choices[0].message.content)

# Manual validation
assert isinstance(data.get("name"), str), "name must be string"
assert isinstance(data.get("age"), int), "age must be integer"
assert 0 <= data.get("age", 0) <= 150, "age out of range"
# ... more validation

With Pydantic (one definition, automatic validation):

from pydantic import BaseModel, Field

class Person(BaseModel):
name: str
age: int = Field(..., ge=0, le=150) # ge=greater-equal, le=less-equal
email: str

response = client.chat.completions.create(...)
data = json.loads(response.choices[0].message.content)
person = Person(**data) # Automatic validation

Basic Pydantic Models

A Pydantic BaseModel is a class with typed fields. When instantiated or validated, Pydantic checks types and constraints.

from pydantic import BaseModel, Field
from typing import Optional

class SentimentAnalysis(BaseModel):
"""Sentiment classification of a text."""
sentiment: str # Field with just type constraint
confidence: float = Field(..., ge=0, le=1) # ge/le = range constraint
explanation: str = Field(..., max_length=200) # max_length constraint

# Valid instantiation
analysis = SentimentAnalysis(
sentiment="positive",
confidence=0.95,
explanation="The text expresses strong approval."
)

# Invalid instantiation (raises ValidationError)
try:
bad = SentimentAnalysis(
sentiment="positive",
confidence=1.5, # Out of range!
explanation="x" * 300 # Too long!
)
except Exception as e:
print(f"Validation error: {e}")

Converting Pydantic Models to JSON Schema

Pydantic models automatically generate a JSON Schema compatible with LLM JSON Mode:

from pydantic import BaseModel, Field

class Customer(BaseModel):
name: str
email: str
phone: str

# Generate JSON Schema
schema = Customer.model_json_schema()
print(schema)

# Output:
# {
# "type": "object",
# "properties": {
# "name": {"type": "string"},
# "email": {"type": "string"},
# "phone": {"type": "string"}
# },
# "required": ["name", "email", "phone"]
# }

Using Pydantic with LLM JSON Mode

Pass the generated schema directly to the LLM API:

from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class ProductReview(BaseModel):
"""Extracted product review."""
product_name: str
rating: int = Field(..., ge=1, le=5)
positive_points: list[str] = Field(..., max_items=5)
negative_points: list[str] = Field(..., max_items=5)
recommendation: bool

response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "user",
"content": "Extract review info from: 'The AirPods are great but battery dies fast.'"
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "ProductReview",
"schema": ProductReview.model_json_schema(),
"strict": True
}
}
)

# Parse LLM response into Pydantic model
review = ProductReview.model_validate_json(response.choices[0].message.content)
print(f"Product: {review.product_name}")
print(f"Rating: {review.rating}/5")
print(f"Recommendation: {review.recommendation}")

Optional Fields and Defaults

Mark fields as optional with Optional[T] or a default value:

from pydantic import BaseModel, Field
from typing import Optional

class Article(BaseModel):
title: str
content: str
author: str = "Anonymous" # Default value
tags: Optional[list[str]] = None # Optional field
word_count: int = Field(default=0, ge=0) # Default with constraint

# Valid instantiation with missing optional fields
article = Article(
title="Learning Python",
content="Python is great...",
# author, tags, word_count are optional/have defaults
)

print(article.author) # "Anonymous"
print(article.tags) # None

Enums and Constrained Strings

Use Python enums for fixed sets of values:

from enum import Enum
from pydantic import BaseModel, Field

class SentimentEnum(str, Enum):
POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"

class TextClassification(BaseModel):
sentiment: SentimentEnum
confidence: float = Field(..., ge=0, le=1)

# Valid
result = TextClassification(sentiment=SentimentEnum.POSITIVE, confidence=0.9)

# Also valid (string is auto-converted to enum)
result = TextClassification(sentiment="positive", confidence=0.9)

# Invalid (Pydantic raises ValidationError)
try:
bad = TextClassification(sentiment="confused", confidence=0.9)
except Exception as e:
print(f"Validation error: {e}")

Nested Pydantic Models

Compose models by nesting other models:

from pydantic import BaseModel, EmailStr

class Address(BaseModel):
street: str
city: str
country: str

class Person(BaseModel):
name: str
email: EmailStr # Built-in email validation
address: Address
phone_numbers: list[str] = []

# Nested instantiation
person = Person(
name="Alice",
email="[email protected]",
address=Address(street="123 Main St", city="NYC", country="USA"),
phone_numbers=["555-0123", "555-4567"]
)

print(person.address.city) # "NYC"

# Pydantic automatically generates nested schema
schema = Person.model_json_schema()
# schema includes Address as a definition

Validating LLM Responses with Pydantic

This is the real power: validate and parse LLM output in one line:

from openai import OpenAI
from pydantic import BaseModel, Field
import json

client = OpenAI()

class ExtractedEntity(BaseModel):
name: str
entity_type: str = Field(..., pattern="^(person|organization|location)$")
confidence: float = Field(..., ge=0, le=1)

class EntityExtraction(BaseModel):
entities: list[ExtractedEntity]

# Get LLM response
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Extract entities from..."}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "EntityExtraction",
"schema": EntityExtraction.model_json_schema(),
"strict": True
}
}
)

# Parse and validate in one call
extraction = EntityExtraction.model_validate_json(response.choices[0].message.content)

# Now use with IDE autocomplete
for entity in extraction.entities:
print(f"{entity.name} ({entity.entity_type}): {entity.confidence:.2f}")

Error Handling with Pydantic

Pydantic's ValidationError provides detailed information about what failed:

from pydantic import BaseModel, Field, ValidationError

class Product(BaseModel):
name: str
price: float = Field(..., gt=0)
stock: int = Field(..., ge=0)

# Parse LLM response that has invalid data
response_data = {
"name": "Widget",
"price": -10, # Invalid (must be > 0)
"stock": "many" # Invalid (must be integer)
}

try:
product = Product(**response_data)
except ValidationError as e:
print(e.json()) # Detailed error report
# Output: [
# {"loc": ["price"], "msg": "ensure this value is greater than 0"},
# {"loc": ["stock"], "msg": "value is not a valid integer"}
# ]

Real-World Example: Multi-Step LLM Pipeline

Use Pydantic models at each step of a multi-LLM workflow:

from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

# Step 1: Classify email
class EmailClassification(BaseModel):
category: str = Field(..., pattern="^(support|sales|billing|feedback)$")
priority: str = Field(..., pattern="^(low|medium|high|urgent)$")

# Step 2: Extract action items
class ActionItem(BaseModel):
task: str
assignee: str
deadline: str

class ActionItems(BaseModel):
items: list[ActionItem]

def process_email(email_text):
# Step 1: Classify
response1 = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": f"Classify: {email_text}"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "EmailClassification",
"schema": EmailClassification.model_json_schema(),
"strict": True
}
}
)
classification = EmailClassification.model_validate_json(response1.choices[0].message.content)

# Step 2: Extract action items (use classification result in prompt)
response2 = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "user", "content": f"Extract action items from: {email_text}. Category: {classification.category}"}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "ActionItems",
"schema": ActionItems.model_json_schema(),
"strict": True
}
}
)
actions = ActionItems.model_validate_json(response2.choices[0].message.content)

return classification, actions

# Use it
category, actions = process_email("Please help me with my billing issue...")
print(f"Category: {category.category}")
for action in actions.items:
print(f" - {action.task} (assigned to {action.assignee})")

Key Takeaways

  • Pydantic BaseModel classes eliminate manual JSON Schema writing; the class is the schema.
  • Type annotations (name: str, age: int) enforce types automatically.
  • Field constraints (Field(..., ge=0, le=100)) prevent invalid values without boilerplate validation code.
  • Nested models compose complex schemas elegantly.
  • model_json_schema() generates LLM-compatible JSON Schema.
  • model_validate_json() parses and validates LLM responses in one call.
  • Enums lock in valid values; Pydantic auto-converts strings to enum members.

Frequently Asked Questions

Do I need to install extra dependencies for email validation?

The EmailStr type requires pydantic[email]. Install with pip install pydantic[email].

Can Pydantic models be serialized to JSON?

Yes. Use model.model_dump_json() to serialize to JSON, or model.model_dump() for a Python dict.

person = Person(name="Alice", email="[email protected]", ...)
json_string = person.model_dump_json()
python_dict = person.model_dump()

How do I handle fields that accept multiple types?

Use Union[Type1, Type2] or Field(..., discriminator=...) for discriminated unions. For LLM outputs, enums are often clearer.

from typing import Union

class Response(BaseModel):
status: str
data: Union[str, int, list[str]]

What if the LLM response has extra fields not in my Pydantic model?

By default, Pydantic ignores extra fields. To reject them, use ConfigDict:

from pydantic import ConfigDict, BaseModel

class StrictModel(BaseModel):
model_config = ConfigDict(extra="forbid") # Reject extra fields
name: str

Does Pydantic add latency to LLM applications?

Negligibly. Validation is fast (microseconds for typical schemas). The latency benefit of structured output far outweighs the parsing cost.

Further Reading