Building Reliability: Constraints and Prompt Engineering
Constrained decoding guarantees syntactic validity (well-formed JSON, matching regex, following grammar), but doesn't guarantee semantic correctness. A model constrained to output JSON with fields [decision, reason] will never produce malformed JSON, but might output {"decision": "approve", "reason": "unknown"} when a human would reject because the reason is meaningless. True reliability comes from combining hard constraints (syntax guarantees) with prompt engineering (semantic guidance): well-crafted prompts that steer the model toward correct reasoning, plus schemas that prevent structural errors, plus monitoring and fallbacks for edge cases.
This final article brings together everything in the series: how to architect resilient AI systems that are both fast and correct.
The Constraint-Prompt Partnership
Think of constraints and prompts as complementary safeguards:
Constraints (hard guarantees): "Output must be valid JSON matching this schema, with these enum values only."
Prompts (soft guidance): "Think step-by-step. Evaluate the evidence. If unsure, say so rather than guess."
Neither alone is sufficient:
- Constraints alone guarantee syntax but allow nonsense:
{"decision": "approve", "confidence": 0.01}is valid but contradictory. - Prompts alone are fragile: even a well-written prompt can fail under adversarial input or with less-capable models.
Together, they form a defense-in-depth approach: prompts set intent, constraints enforce structure, monitoring catches failures.
Strategy 1: In-Context Examples (Few-Shot Learning)
The most effective prompt technique is showing the model examples of correct outputs:
from outlines import models, generate
from pydantic import BaseModel
class Decision(BaseModel):
choice: str # "approve", "reject", "escalate"
reasoning: str
confidence: float # 0.0-1.0
model = models.transformers("mistralai/Mistral-7B-v0.1")
generator = generate.json(model, Decision)
# Few-shot prompt with examples
prompt = """
You are a loan officer. Analyze applications and decide: approve, reject, or escalate.
Output JSON with your choice, brief reasoning, and confidence (0-1).
Examples:
Input: Applicant has $100k income, 750 credit score, minimal debt.
Output: {"choice": "approve", "reasoning": "Strong income and credit profile.", "confidence": 0.95}
Input: Applicant has $20k income, 500 credit score, high debt.
Output: {"choice": "reject", "reasoning": "Insufficient income and poor credit history.", "confidence": 0.90}
Input: Applicant has $40k income, 680 credit score, some delinquencies.
Output: {"choice": "escalate", "reasoning": "Mixed signals; human review needed.", "confidence": 0.65}
Now analyze this application:
Input: Applicant has $50k income, 700 credit score, 2-year account history.
Output:"""
result = generator(prompt, max_tokens=200)
print(result)
The few-shot examples guide the model toward high-quality reasoning within the constraints. Result: better reasoning field without relaxing the schema.
Strategy 2: Chain-of-Thought with Structured Output
Ask the model to reason before providing the final structured answer:
from pydantic import BaseModel
class AnalysisResult(BaseModel):
analysis: str # Reasoning
conclusion: str # Final answer
confidence: float
model = models.transformers("mistralai/Llama-2-13b-hf")
generator = generate.json(model, AnalysisResult)
prompt = """
Analyze the statement: "All birds can fly."
Think step-by-step:
1. Identify exceptions (penguins, ostriches, chickens cannot fly).
2. Evaluate the claim's accuracy.
3. Provide a conclusion.
Output JSON with your analysis, conclusion (true/false), and confidence.
Output:"""
result = generator(prompt, max_tokens=300)
print(result)
# Output: {
# "analysis": "While most birds can fly, there are several flightless species. The statement is too broad.",
# "conclusion": "false",
# "confidence": 0.95
# }
Chain-of-thought (thinking out loud) improves reasoning quality without needing a separate reasoning phase. Constraints ensure the final answer is structured correctly.
Strategy 3: Explicit Confidence and Uncertainty
Allow the model to express uncertainty instead of guessing:
from pydantic import BaseModel
from typing import Optional
class ConfidentAnswer(BaseModel):
answer: str
confidence: float # 0.0-1.0
uncertainty_reason: Optional[str] = None
# Schema with confidence + optional explanation of doubt
# The model is *guided* to express low confidence when unsure,
# rather than *forced* to guess.
prompt = """
Answer the question. If unsure, set confidence low and explain why.
Question: Who won the 2026 US Presidential election?
Output:"""
result = generator(prompt, max_tokens=200)
# If confidence < 0.3, treat as "unknown" and escalate to human
if result["confidence"] < 0.3:
print(f"Uncertain answer; escalating. Reason: {result['uncertainty_reason']}")
else:
print(f"High-confidence answer: {result['answer']}")
This prevents the model from inventing answers. Low confidence + explicit reason = signal to retry, escalate, or use alternative data.
Strategy 4: Fallback and Retry Loops
Combine constraints with adaptive fallback logic:
from outlines import models, generate
from pydantic import BaseModel
import logging
class ExtractionResult(BaseModel):
entity: str
entity_type: str # "person", "organization", "location"
confidence: float
model = models.transformers("mistralai/Mistral-7B-v0.1")
generator = generate.json(model, ExtractionResult)
def extract_entity_with_fallback(text, max_retries=3):
"""Extract entity with automatic retry on low confidence."""
for attempt in range(max_retries):
prompt = f"""
Extract the main entity from this text:
"{text}"
Output JSON with entity name, type (person/organization/location), and confidence (0-1).
Output:"""
result = generator(prompt, max_tokens=150)
if result["confidence"] >= 0.7:
return result # Success
logging.warning(
f"Attempt {attempt + 1}: Low confidence ({result['confidence']}). "
f"Retrying with longer context..."
)
# Retry with more context or refined prompt
if attempt < max_retries - 1:
# On retry, add clarifying instructions
text = text + "\n[Clarification: Focus on the primary subject, not secondary mentions.]"
# Final attempt failed; return lowest-confidence result and flag for review
logging.error(f"All {max_retries} attempts failed. Manual review required.")
return None
# Usage
result = extract_entity_with_fallback("Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976.")
if result:
print(f"Extracted: {result['entity']} ({result['entity_type']}, confidence: {result['confidence']})")
This hybrid approach:
- Tries with constraints first (fast, reliable).
- Checks confidence (model-provided signal).
- Retries on failure with refined prompts.
- Escalates to human if all attempts fail.
Strategy 5: Multi-Stage Pipeline
For complex tasks, break into multiple constrained stages:
from pydantic import BaseModel
from outlines import models, generate
class ClassificationStage(BaseModel):
category: str
subcategory: str
class SentimentStage(BaseModel):
sentiment: str # "positive", "negative", "neutral"
intensity: int # 1-5
class SummaryStage(BaseModel):
summary: str
key_points: list[str]
model = models.transformers("mistralai/Mistral-7B-v0.1")
# Stage 1: Classify
classifier = generate.json(model, ClassificationStage)
category = classifier(
"Classify this review: 'The product broke after one week of use.'",
max_tokens=100
)
print(f"Category: {category}")
# Stage 2: Analyze sentiment (now knowing the category)
sentiment_analyzer = generate.json(model, SentimentStage)
sentiment = sentiment_analyzer(
f"The review is about {category['category']}. Analyze sentiment: 'The product broke after one week.'",
max_tokens=100
)
print(f"Sentiment: {sentiment}")
# Stage 3: Summarize with context from earlier stages
summarizer = generate.json(model, SummaryStage)
summary = summarizer(
f"Summarize this {sentiment['sentiment']} {category['category']} review (intensity {sentiment['intensity']}/5): 'The product broke after one week.'",
max_tokens=200
)
print(f"Summary: {summary}")
Each stage:
- Outputs a constrained, validated result.
- Feeds into the next stage (providing context).
- Reduces hallucination by narrowing scope per stage.
Strategy 6: Monitoring and Alerting
Track constraint violations and semantic errors:
from pydantic import BaseModel
import json
from datetime import datetime
class MonitoredExtraction(BaseModel):
name: str
email: str
phone: str
class ExtractionEvent:
def __init__(self, prompt, result, latency, model, constraint_type):
self.prompt = prompt
self.result = result
self.latency = latency
self.model = model
self.constraint_type = constraint_type
self.timestamp = datetime.utcnow()
self.validation_passed = self._validate()
def _validate(self):
"""Check semantic correctness beyond schema."""
# Email format check
if "@" not in self.result.email:
return False, "Invalid email format"
# Phone format check
if not any(c.isdigit() for c in self.result.phone):
return False, "Phone lacks digits"
# Confidence check (if model provided)
if hasattr(self.result, "confidence") and self.result.confidence < 0.5:
return False, "Low model confidence"
return True, "Valid"
def log(self):
passed, reason = self.validation_passed
status = "PASS" if passed else "FAIL"
print(f"[{status}] {self.timestamp.isoformat()} | "
f"Model: {self.model} | Constraint: {self.constraint_type} | "
f"Latency: {self.latency:.2f}s | Reason: {reason}")
# Usage
model = models.transformers("mistralai/Mistral-7B-v0.1")
generator = generate.json(model, MonitoredExtraction)
import time
start = time.time()
result = generator("Extract: John Doe, [email protected], 555-1234", max_tokens=150)
latency = time.time() - start
event = ExtractionEvent(
prompt="Extract: John Doe...",
result=result,
latency=latency,
model="Mistral-7B",
constraint_type="json_schema"
)
event.log()
Monitoring enables:
- Detection of systematic failures (e.g., "emails always invalid").
- Performance optimization (identify slow constraint types).
- User experience improvement (fail gracefully when reliability is low).
Strategy 7: Human-in-the-Loop for Edge Cases
For high-stakes decisions, require human approval:
from pydantic import BaseModel
import uuid
class MedicalDecision(BaseModel):
diagnosis: str
treatment: str
confidence: float
class RequestForApproval:
def __init__(self, decision, threshold=0.8):
self.id = str(uuid.uuid4())
self.decision = decision
self.requires_approval = decision.confidence < threshold
self.status = "pending_approval" if self.requires_approval else "auto_approved"
def submit_to_human(self):
"""Queue for human review."""
print(f"[Approval Request {self.id}] Diagnosis: {self.decision.diagnosis} "
f"(Confidence: {self.decision.confidence:.2f})")
print("Awaiting human review...")
# In production: send to approval queue, DB, etc.
# Usage
model = models.transformers("mistralai/Llama-2-70b-hf")
generator = generate.json(model, MedicalDecision)
result = generator(
"Analyze symptoms: fever, cough, shortness of breath. Suggest diagnosis and treatment.",
max_tokens=300
)
approval = RequestForApproval(result, threshold=0.9)
if approval.requires_approval:
approval.submit_to_human()
else:
print(f"Auto-approved: {result.diagnosis}")
For medical, financial, or legal domains, human oversight is essential. Constraints ensure the form is correct; humans ensure the decision is sound.
Best Practices Checklist
When designing reliable constrained systems:
- Start with unconstrained baselines — measure what the model does naturally before constraining.
- Use few-shot examples — show the model what success looks like.
- Add confidence scores — let the model signal uncertainty.
- Monitor all outputs — log results, latency, errors.
- Test with adversarial inputs — prompt injections, ambiguous data, out-of-domain text.
- Set confidence thresholds — auto-approve high-confidence, flag low-confidence for review.
- Design fallback chains — retry with refined prompts, escalate to human.
- Document constraint intent — why this schema, why these values, what will break it.
- Version constraints — track schema changes, maintain backward compatibility if needed.
- Measure end-to-end reliability — overall success rate, not just syntactic validity.
Real-World Architecture: Loan Approval System
Here's how all these techniques come together:
from pydantic import BaseModel
from outlines import models, generate
import logging
class LoanApprovalRequest(BaseModel):
applicant_name: str
decision: str # "approve", "reject", "escalate"
reasoning: str
confidence: float
required_documents: list[str] = []
def approve_loan_with_reliability(application_text: str) -> dict:
"""
Approve a loan with constraints + prompts + fallback + human oversight.
"""
model = models.transformers("mistralai/Llama-2-13b-hf")
generator = generate.json(model, LoanApprovalRequest)
# Stage 1: Constrained generation with few-shot prompting
prompt = f"""
You are a senior loan officer with 10 years of experience. Analyze this application carefully.
Decision options: approve, reject, escalate
- Approve: Strong financials, low risk
- Reject: Clear disqualifying factors
- Escalate: Mixed signals, ambiguous case
Think through the evidence before deciding. Be honest about confidence level.
Examples:
Income: $150k, Credit: 800, Debt: Low -> Approve (0.95 confidence)
Income: $20k, Credit: 450, Debt: High -> Reject (0.90 confidence)
Income: $60k, Credit: 650, Debt: Moderate -> Escalate (0.70 confidence)
Application: {application_text}
Output JSON with decision, reasoning, confidence (0-1), and required docs if approved.
Output:"""
try:
result = generator(prompt, max_tokens=300, temperature=0.3)
except Exception as e:
logging.error(f"Generation failed: {e}")
return {"status": "error", "message": str(e)}
# Stage 2: Validate and monitor
if result.confidence < 0.6:
logging.warning(f"Low confidence ({result.confidence}) for {result.applicant_name}. Escalating.")
return {
"status": "escalated",
"reason": "Model uncertainty",
"model_decision": result.decision,
"reasoning": result.reasoning
}
# Stage 3: Final decision
if result.decision == "escalate":
logging.info(f"Model recommends escalation: {result.reasoning}")
return {
"status": "escalated",
"reason": result.reasoning,
"recommendation": "Review by senior underwriter"
}
elif result.decision == "approve":
logging.info(f"Loan approved for {result.applicant_name}")
return {
"status": "approved",
"reasoning": result.reasoning,
"confidence": result.confidence,
"required_documents": result.required_documents
}
else: # reject
logging.info(f"Loan rejected for {result.applicant_name}: {result.reasoning}")
return {
"status": "rejected",
"reasoning": result.reasoning,
"confidence": result.confidence
}
# Usage
result = approve_loan_with_reliability(
"Applicant: John Smith, Income: $75k, Credit Score: 710, Debt: $15k"
)
print(result)
This system combines:
- Constraints (valid JSON schema).
- Prompting (few-shot examples, reasoning guidance).
- Confidence thresholds (auto vs. escalate).
- Fallback (escalation to human).
- Monitoring (logging all decisions).
Key Takeaways
- Constraints guarantee syntax; prompts guide semantics. Use both for maximum reliability.
- Few-shot examples, chain-of-thought, and confidence scores improve output quality within constraints.
- Multi-stage pipelines reduce hallucination by narrowing scope per stage.
- Monitoring, fallback chains, and human oversight catch errors constraints miss.
- For high-stakes applications, design for human-in-the-loop: auto-approve high-confidence, escalate low-confidence.
- Reliability is a systems property, not a single technique—combine constraints, prompts, monitoring, and processes.
Frequently Asked Questions
How much does adding constraints slow down generation?
Typically 10–25% overhead for simple constraints (enums, JSON schemas); up to 50% for complex grammars. This cost is worth the reliability gain: zero parse failures, no retry loops, predictable output.
Can I use constraints without changing my prompts?
Yes, but quality suffers. Constraints enforce structure; prompts guide reasoning. Pairing them (constraints + well-written prompts) is more effective than either alone.
What if the model contradicts my constraints?
The constraint forces the model to choose a valid option even if it "disagrees." For example, if you constrain to ["yes", "no"], the model can't output "maybe"—it must pick one. This is a feature (structure) but can distort semantics (force false binary). Mitigate by widening constraints to allow ambiguity (e.g., add "unsure" as a valid option).
How do I know if my system is reliable enough?
Measure: (1) Syntactic validity (100% with constraints), (2) Semantic correctness (manual review sample, e.g., 100 results), (3) User satisfaction (feedback from end-users), (4) Error rate (% of requests requiring human review). Set targets: e.g., 95% semantic correctness, <5% manual review rate.
Can I combine constraints from different libraries?
Outlines is the standard; most other tools (llama.cpp, vLLM) use compatible GBNF syntax. You can switch between them. However, custom constraints (e.g., a specialized FSM) may not transfer; stick with standard formats (JSON schema, GBNF) for portability.
Further Reading
- Prompt Engineering Best Practices (OpenAI) — Foundational prompt techniques.
- Chain-of-Thought Prompting (Wei et al., 2022) — Seminal paper on reasoning-before-answering.
- Reliable AI Systems: The Role of Constraints and Monitoring — Anthropic's perspective on reliability.
- Building Resilient Systems with LLMs (Tutorial) — Practical systems design.