Constitutional AI (CAI)
Also Known As
TL;DR
Explanation
Constitutional AI (CAI) is Anthropic's approach to scaling AI alignment by replacing much of the human feedback in RLHF with AI-generated feedback against an explicit set of principles — the 'constitution'. Training runs in two phases. (1) SL-CAI: a supervised stage where the model generates a response, critiques itself against the constitution ('this response is harmful because...'), revises the response, and is fine-tuned on the revisions. (2) RLAIF (Reinforcement Learning from AI Feedback): a model labels which of two responses better follows the constitution, and the resulting preferences train a reward model used in standard RL. The constitution itself is a curated list of principles drawn from sources like the UN Declaration of Human Rights, terms of service, and safety considerations. CAI's key contribution is making alignment principles explicit and auditable rather than implicit in millions of human ratings, while reducing the human labour bottleneck. Claude is trained using a combination of RLHF and CAI. Developers can apply the same pattern at the application layer: have the model critique its own outputs against your principles before returning, an effective lightweight guardrail.
Common Misconception
Why It Matters
Common Mistakes
- Confusing 'constitutional AI' with anything related to actual constitutions or legal text — it is principle-based AI training.
- Writing a single-sentence 'constitution' for an app and expecting deep alignment — production CAI uses curated, multi-source, iteratively refined principles.
- Skipping evaluation — even CAI'd models can misbehave; you must still red-team and benchmark.
- Using AI self-critique without principle grounding — vague 'is this OK?' prompts produce vague, often wrong, judgments.
Avoid When
- Latency-critical paths where an extra critique call is unacceptable — use cheaper guardrails (regex, classifiers) instead.
- Single-shot calls where the principles are already embedded in the system prompt and additional self-critique adds no value.
When To Use
- Building LLM features that require consistent adherence to explicit policies (content moderation, customer support, code review).
- Adding a runtime guardrail that catches policy violations before output reaches the user.
- Explaining model behaviour and refusals to stakeholders or end users.
Code Examples
// ❌ Vague self-critique with no principle grounding
$critique = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 200,
'messages' => [[
'role' => 'user',
'content' => "Is this response OK? \n\n{$draftResponse}"
]]
]);
// 'OK' is undefined — model will return non-actionable judgment.
// ✅ Application-layer CAI: critique against explicit principles, then revise
$constitution = <<<TXT
Responses must:
1. Be factually accurate; flag uncertainty when present.
2. Avoid recommending actions that could cause data loss without explicit confirmation.
3. Not include credentials, secrets, or PII verbatim.
4. Use the customer's language and respect their stated preferences.
TXT;
$critique = $client->messages->create([
'model' => 'claude-sonnet-4-20250514',
'max_tokens' => 600,
'system' => "Critique the response against these principles. "
. "For each violation, name the principle and quote the offending text.\n\n{$constitution}",
'messages' => [[
'role' => 'user',
'content' => "Original request:\n{$userMessage}\n\nDraft response:\n{$draftResponse}"
]]
]);
// Then either revise the draft based on the critique,
// or block the response if violations are severe.