Schema-Based Validation: Enforcing Structure
Schema validation is the first line of defense. It ensures your LLM output matches a predefined structure—JSON with specific fields, types, and constraints. A schema is a contract: the LLM promises to produce output that matches it, and your code validates that promise at runtime.
In production systems, schema validation reduces invalid outputs by 80–90% when paired with schema-in-prompt engineering (Anthropic, 2025). This article teaches you how to define schemas, validate against them, and steer LLMs toward compliance.
What Is a Schema?
A schema is a specification describing the shape, types, and constraints of valid data. A simple schema for a movie review might be:
{
"title": string,
"rating": number (1–10),
"spoilers_detected": boolean,
"summary": string (max 500 chars)
}
This schema says: "An object with a title (string), a rating (1–10), a boolean spoiler flag, and a summary under 500 characters." Any output that violates these rules fails schema validation.
Pydantic: Python's Schema Validator
Pydantic is the standard Python library for schema definition and validation. Define a model using type hints, then validate LLM output by parsing it:
from pydantic import BaseModel, Field
from typing import Literal
class MovieReview(BaseModel):
title: str = Field(max_length=100)
rating: int = Field(ge=1, le=10, description="Rating from 1 to 10")
spoilers_detected: bool
summary: str = Field(max_length=500)
# LLM output (as JSON string)
llm_output = """
{
"title": "Dune: Part Two",
"rating": 9,
"spoilers_detected": false,
"summary": "Epic sci-fi spectacle with stunning visuals and political intrigue."
}
"""
# Validate by parsing
review = MovieReview.model_validate_json(llm_output)
print(review.rating) # 9
If the LLM returns invalid JSON or a missing required field, model_validate_json() raises a ValidationError:
invalid_output = """
{
"title": "Dune",
"rating": 15,
"summary": "Great film"
}
"""
try:
review = MovieReview.model_validate_json(invalid_output)
except ValidationError as e:
print(f"Validation failed: {e}")
# Output: rating must be <= 10
# Output: spoilers_detected missing
JSON Schema: Language-Agnostic Schemas
JSON Schema is a standard for describing JSON structure. It works across all languages and integrates with many LLM APIs. A JSON Schema version of the movie review:
import json
schema = {
"type": "object",
"properties": {
"title": {"type": "string", "maxLength": 100},
"rating": {"type": "integer", "minimum": 1, "maximum": 10},
"spoilers_detected": {"type": "boolean"},
"summary": {"type": "string", "maxLength": 500}
},
"required": ["title", "rating", "spoilers_detected", "summary"],
"additionalProperties": False
}
# Validate with jsonschema library
import jsonschema
output = json.loads(llm_output)
jsonschema.validate(output, schema) # Raises ValidationError if invalid
Schema-in-Prompt: Steering LLMs Toward Valid Output
The most powerful technique is embedding the schema in your prompt. Tell the LLM exactly what structure you expect:
Analyze the following movie review and extract key information.
Return ONLY valid JSON matching this schema (no markdown, no extra text):
{
"title": "string (max 100 chars)",
"rating": "integer from 1 to 10",
"spoilers_detected": "boolean",
"summary": "string (max 500 chars)"
}
Review text:
[REVIEW]
JSON output:
When the schema is explicit, LLMs comply 85–95% of the time (OpenAI, 2025). Add examples for even better compliance:
Example valid outputs:
{"title": "Dune", "rating": 9, "spoilers_detected": false, "summary": "Epic."}
{"title": "Oppenheimer", "rating": 8, "spoilers_detected": true, "summary": "Historical drama."}
Now analyze this review:
[REVIEW]
Constrained Generation: Steering at Token Level
Some modern LLM APIs support constrained generation—the model is forced to output only tokens that match your schema. With Outlines or similar libraries, you can generate outputs guaranteed to be valid JSON:
from outlines import models, generate
# Load model with grammar constraints
model = models.transformers("mistral-7b")
generator = generate.json(model, schema)
# Output is guaranteed valid JSON matching schema
output = generator("Extract movie info from: [review]", max_tokens=200)
This reduces validation failures to near zero but adds computation overhead. Use it for critical paths.
Handling Validation Failures
When schema validation fails, you have three options:
- Retry with corrective feedback: "Your previous response had invalid rating (must be 1–10). Fix it and try again."
- Fall back to a default: If you have a cached or template response, use it.
- Escalate or log: If retries fail, escalate to a human or log for analysis.
def extract_movie_info(review: str, max_retries: int = 3) -> Optional[MovieReview]:
for attempt in range(max_retries):
output = llm_call(review, schema)
try:
return MovieReview.model_validate_json(output)
except ValidationError as e:
if attempt < max_retries - 1:
# Retry with feedback
output = llm_call(review, schema, feedback=str(e))
else:
return None
Comparison: Schema Approaches
| Approach | Compliance | Setup Effort | Overhead | Best For |
|---|---|---|---|---|
| Schema-in-prompt | 85–90% | Low | Minimal | Most production systems |
| Pydantic validation | 100% (after retry) | Low | Low | Python backends |
| JSON Schema | 100% (after retry) | Medium | Low | Language-agnostic systems |
| Constrained generation | 95–99% | High | Medium | Critical paths where failure is costly |
Key Takeaways
- A schema is a contract defining the expected structure, types, and constraints of LLM output.
- Pydantic is the Python standard for schema definition and validation.
- Embedding the schema in your prompt ("schema-in-prompt") increases LLM compliance to 85–90%.
- Constrained generation guarantees valid output but adds computational overhead.
- Always handle validation failures with retry, fallback, or escalation logic.
Frequently Asked Questions
Should I use Pydantic or JSON Schema?
Use Pydantic if you're building Python backends (easier, more expressive). Use JSON Schema if you need language-agnostic specs or are working with LLM APIs that natively support JSON Schema constraints.
How much does schema-in-prompt improve compliance?
Studies show 10–15% improvement over implicit schemas. Combined with examples, you reach 85–90% compliance. Constrained generation reaches 95–99% but is computationally heavier.
What if the LLM returns text instead of JSON?
That's a parsing error, not a validation error. Use regex or try parsing with error recovery (e.g., json5 library for lenient JSON). Then validate the parsed object against your schema.
Can I validate outputs from streaming APIs?
Yes, but it's harder. Buffer the entire response, then validate. For real-time feedback, validate incrementally (once an object closes, validate it) or use guided generation to steer tokens.