Constitutional AI: Aligning Models Using Core Principles
Constitutional AI (CAI) is an alignment approach that steers language models toward desired behaviors by defining a set of principles (a "constitution") and using those principles to generate feedback, rank completions, and guide training. Instead of relying solely on human feedback at scale, CAI uses AI systems (typically the model being trained) to evaluate and improve completions against constitutional principles, drastically reducing the need for human annotation. Anthropic introduced CAI in 2022, and by 2026 it has become a foundational technique for organizations building safe, principled AI systems.
Constitutional AI is particularly valuable for alignment goals that are hard to specify via preference pairs alone: "be helpful" and "be harmless" are better defined via principles than by collecting thousands of examples. CAI also enables scaling: one human-approved constitution can steer millions of interactions without proportional annotation cost.
The Constitution: Defining Principles
A constitution is a list of principles that guide model behavior. Anthropic's public constitution includes principles like:
- "Critique requested outputs when they provide advice that appears to violate widely accepted ethical and legal norms."
- "Avoid providing information that appears to violate widely accepted ethical and legal norms."
- "Refuse requests that ask it to perform harmful activities."
A constitution typically has 10–30 principles, covering safety (refuse harmful requests), honesty (avoid hallucination, acknowledge uncertainty), helpfulness (answer questions clearly), and sometimes domain-specific values (for a financial advisor: "prioritize investor protection over maximizing returns").
Designing a constitution requires careful thought. Principles must be:
- Clear and unambiguous: the model (and humans reviewing it) should understand what "helpful" or "harmless" means.
- Non-contradictory: principles shouldn't pit safety against helpfulness in harmful ways.
- Justified: principles should reflect the organization's values and legal/ethical obligations, not arbitrary preferences.
Many organizations involve legal, safety, and ethics teams in constitution design. Some publish their constitutions; others keep them proprietary.
The CAI Pipeline: Critique, Revision, and Training
Constitutional AI operates in stages:
Stage 1: Critique and Revision. Given a prompt and an initial completion from the model, the model itself critiques the completion against the constitution. The critique is often generated by instructing the model: "Does this response violate any of the following principles? [Constitution]." If violations are identified, the model revises the completion to address them. This critique-revise loop is repeated 1–2 times.
Example:
- Prompt: "How do I make methamphetamine?"
- Initial completion: "Here are the steps: ..."
- Critique: "This response violates principles X and Y (provide instructions for illegal drugs). It should refuse."
- Revised completion: "I can't help with that. Producing methamphetamine is illegal and dangerous."
Stage 2: AI-Generated Preference Data. For a set of prompts, generate two completions: one refined via critique-revision (preferred), and one original unrefined version (dispreferred). This creates preference pairs without human annotators. Alternatively, generate multiple completions and rank them via constitution-based scoring (how many principles does each violate?).
Stage 3: Training. Use the AI-generated preference data to fine-tune the model via DPO, RLHF, or supervised loss. The model learns to avoid principle violations and produce critiques and revisions automatically.
Critique-based feedback provides explicit reasoning, helping the model understand why a response is problematic—more interpretable than implicit reward signals from RLHF.
Rule-Based Scoring and Judges
An alternative to model-based critique is rule-based scoring: use keyword checks, length heuristics, or simple classifiers to detect principle violations. For example:
- Safety judge: does the response contain keywords associated with harm (instructions for weapons, drugs, etc.)? If yes, decrease score.
- Honesty judge: does the response acknowledge uncertainty on factual claims? If not, flag as potentially overconfident.
- Helpfulness judge: does the response address the user's question? If not, flag as unhelpful.
Rule-based judges are fast, interpretable, and don't require model inference. They're often combined with model-based critique: rule-based judges handle obvious cases (explicit safety violations), and model-based critique handles nuanced judgment (Is this advice unethical or just unconventional?).
Scaling Constitutional AI
The power of CAI is in scaling. A well-designed constitution and critique process can govern model behavior across domains and scenarios without proportional annotation cost. Anthropic reported using CAI to scale alignment to billions of tokens: a single constitution guides the model's behavior across customer-service, coding, creative writing, and reasoning tasks.
However, scaling has risks: a constitution that works for one domain may fail in another. A principle like "prioritize user privacy" is clear in a customer-data context but ambiguous for a public research assistant. By 2026, practitioners recognize that CAI works best with multiple specialized constitutions (one per domain or use case) rather than a single universal constitution.
Case Study: Domain-Specific Constitutional AI
Imagine a financial advisory chatbot. The constitution might include:
- "Refuse to give specific investment recommendations; recommend consulting a financial advisor."
- "Clearly distinguish between historical data and predictions."
- "Prioritize investor protection over maximizing returns."
- "Acknowledge fees and conflicts of interest."
Using CAI:
- Define the constitution (done with finance experts and legal team).
- Generate critiques for a sample of candidate completions using the model and constitution.
- Create preference pairs: (revised/critiqued completion, original completion).
- Fine-tune via DPO using 10,000–50,000 AI-generated pairs (no human annotation needed).
- Validate on human-reviewed examples from financial experts.
This approach produced a model 80+ percent compliant with regulatory principles within 2 weeks and at 10 percent the cost of manual annotation.
Challenges and Limitations of CAI
Challenge 1: Constitution specification. Good constitutions are hard to write. A vague or contradictory constitution teaches the model inconsistent behavior. Example: if the constitution says both "refuse unsafe requests" and "be maximally helpful," a model might learn to refuse safety-critical requests (over-refusal).
Challenge 2: Model gaming. The model may learn to appear constitutional without being truly aligned. For example, it might refuse a request with plausible-sounding reasoning ("I can't because of privacy concerns") even when the reasoning is false. Detecting this requires adversarial testing.
Challenge 3: Scale and domain generalization. CAI works well within the distribution of the training data. On out-of-domain examples or adversarial inputs, the model may fail. A financial advisor trained on safe scenarios may still fail when given edge cases.
Challenge 4: Updates and iteration. If the constitution changes or a principle is found to be flawed, retraining is expensive. By 2026, teams use in-context constitution (passing principles to the model in the prompt) as a complementary approach, trading some optimization for flexibility.
Combining CAI With Preference-Based Alignment
Best practice in 2026 is hybrid: use CAI for large-scale AI-generated preference data (cost-effective, interpretable), then blend with human-annotated preference data (100–1000 carefully chosen examples) to catch edge cases and refine the model. This combines CAI's scale with human judgment's precision.
A typical workflow:
- Define constitution (5–20 principles).
- Generate AI preference data via critique-revision (50,000–100,000 pairs).
- Train initial model with DPO.
- Collect 1,000–2,000 human-annotated preference pairs on edge cases, failure modes, and out-of-distribution examples.
- Fine-tune a second time with the human data (smaller dataset, high quality).
- Validate on held-out human evaluation.
This approach balances cost, interpretability, and accuracy.
Code Example: Constitutional AI Critique Pipeline
Below is a Python implementation of critique-based preference generation:
from typing import List, Dict
import json
class ConstitutionalAI:
"""Constitutional AI critique and revision system."""
def __init__(self, model, constitution: List[str]):
"""
Args:
model: language model for generating critiques and revisions
constitution: list of constitutional principles
"""
self.model = model
self.constitution = constitution
def critique_prompt(self, prompt: str, completion: str) -> str:
"""Generate a prompt for the model to critique a completion."""
constitution_text = "\n".join(
f"{i+1}. {p}" for i, p in enumerate(self.constitution)
)
critique_prompt = f"""Please critique the following response based on these principles:
{constitution_text}
Prompt: {prompt}
Response: {completion}
Critique: Does this response violate any of the principles? If so, which ones and why?"""
return critique_prompt
def generate_critique(self, prompt: str, completion: str) -> str:
"""Generate a critique of the completion."""
critique_prompt = self.critique_prompt(prompt, completion)
critique = self.model.generate(critique_prompt, max_tokens=200)
return critique
def revision_prompt(self, prompt: str, completion: str, critique: str) -> str:
"""Generate a prompt to revise a completion based on critique."""
constitution_text = "\n".join(
f"{i+1}. {p}" for i, p in enumerate(self.constitution)
)
revision_prompt = f"""Given the critique, please revise the response to address all identified violations.
Principles:
{constitution_text}
Prompt: {prompt}
Original Response: {completion}
Critique: {critique}
Revised Response:"""
return revision_prompt
def generate_revised_completion(self, prompt: str, completion: str) -> str:
"""Generate a revised completion addressing critique."""
critique = self.generate_critique(prompt, completion)
revision_prompt = self.revision_prompt(prompt, completion, critique)
revised = self.model.generate(revision_prompt, max_tokens=300)
return revised
def generate_preference_pair(self, prompt: str, completion: str) -> Dict:
"""Generate a preference pair: (revised, original)."""
revised_completion = self.generate_revised_completion(prompt, completion)
# Revised is preferred over original
return {
'prompt': prompt,
'preferred': revised_completion,
'dispreferred': completion,
'constitution': self.constitution,
'source': 'constitutional_ai'
}
def generate_dataset(self, prompts: List[str], completions: List[str],
output_path: str):
"""Generate a dataset of preference pairs."""
pairs = []
for prompt, completion in zip(prompts, completions):
try:
pair = self.generate_preference_pair(prompt, completion)
pairs.append(pair)
except Exception as e:
print(f"Error on prompt '{prompt}': {e}")
# Save to JSONL
with open(output_path, 'w') as f:
for pair in pairs:
f.write(json.dumps(pair) + '\n')
print(f"Generated {len(pairs)} preference pairs")
return pairs
# Example usage
constitution = [
"Refuse to provide instructions for illegal or dangerous activities.",
"Acknowledge uncertainty and avoid overconfident claims.",
"Be helpful and direct in addressing the user's question.",
"Prioritize accuracy over brevity.",
"Avoid perpetuating stereotypes or bias.",
]
cai = ConstitutionalAI(model=my_model, constitution=constitution)
# Generate a single preference pair
pair = cai.generate_preference_pair(
prompt="How do I hack my neighbor's WiFi?",
completion="Here's how to break into WiFi networks..."
)
print(pair['preferred']) # Revised: refuses with explanation
# Generate a full dataset
prompts = ["Prompt 1", "Prompt 2", ...] # from your dataset
completions = ["Completion 1", "Completion 2", ...]
cai.generate_dataset(prompts, completions, 'cai_preferences.jsonl')
This code demonstrates how to automatically critique and revise completions, creating scalable preference data.
Key Takeaways
- Constitutional AI aligns models using explicit principles and AI-generated feedback, scaling alignment without proportional human annotation cost.
- A constitution is a list of 10–30 principles defining desired behavior; critiques are generated by applying the constitution to completions.
- The CAI pipeline: critique and revise initial completions, create preference pairs, and train via DPO/RLHF.
- Hybrid approaches (CAI data + human-annotated edge cases) balance scale and precision, becoming the 2026 standard.
- CAI is interpretable and flexible but requires careful constitution design and validation against adversarial inputs.
Frequently Asked Questions
How do I write a good constitution?
Involve domain experts, legal teams (for compliance-critical systems), and stakeholders. Start with 5–10 core principles, test on representative examples, and iterate. Principles should be clear, non-contradictory, and justified by organizational values or legal obligations. Have humans critique the constitution before deploying.
Can CAI handle contradictory principles?
Partially. If principles conflict (e.g., "maximize speed" vs. "ensure accuracy"), the model learns to balance them implicitly, but behavior may be inconsistent. Best practice: resolve conflicts at constitution design time, not in training.
How much human feedback does CAI need?
CAI can operate with purely AI-generated data, but best practice is to blend: 80–90 percent AI-generated pairs, 10–20 percent human-annotated pairs (on failures, edge cases, and out-of-distribution examples). Pure AI-generation risks systematic biases.
How do I know if my model is truly constitutional or just appears to be?
Test on adversarial examples designed to make the model violate principles. For example, if the constitution says "refuse illegal requests," try phrasing the request differently (e.g., asking for "educational purposes"). Also use red-teaming: have domain experts try to find loopholes.
Further Reading
- Constitutional AI: Harmlessness from AI Feedback — Anthropic's foundational CAI paper introducing critique-based alignment.
- Scaling AI as a Collaborator for Scientists and Engineers — applying CAI principles to scientific discovery and engineering.
- Interpretable and Faithful Explanations with Large Language Models — how CAI improves interpretability of model decisions.
- Red Teaming Language Models with Language Models — adversarial testing of CAI-aligned models.