← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Constitutional AI (CAI)

ai_ml Advanced

Also Known As

CAI RLAIF AI feedback training principle-based alignment self-critique training

TL;DR

Anthropic's training methodology where models critique and revise their own outputs against a set of written principles, reducing reliance on human labellers for alignment.

Explanation

Constitutional AI (CAI) is Anthropic's approach to scaling AI alignment by replacing much of the human feedback in RLHF with AI-generated feedback against an explicit set of principles — the 'constitution'. Training runs in two phases. (1) SL-CAI: a supervised stage where the model generates a response, critiques itself against the constitution ('this response is harmful because...'), revises the response, and is fine-tuned on the revisions. (2) RLAIF (Reinforcement Learning from AI Feedback): a model labels which of two responses better follows the constitution, and the resulting preferences train a reward model used in standard RL. The constitution itself is a curated list of principles drawn from sources like the UN Declaration of Human Rights, terms of service, and safety considerations. CAI's key contribution is making alignment principles explicit and auditable rather than implicit in millions of human ratings, while reducing the human labour bottleneck. Claude is trained using a combination of RLHF and CAI. Developers can apply the same pattern at the application layer: have the model critique its own outputs against your principles before returning, an effective lightweight guardrail.

Common Misconception

Constitutional AI replaces RLHF entirely or eliminates the need for human input. CAI reduces and complements RLHF — humans still author the constitution and run extensive evals. The 'AI feedback' is grounded in human-written principles; CAI scales human judgment rather than replacing it.

Why It Matters

CAI explains a lot about how Claude behaves: why it refuses certain requests, why it offers balanced perspectives, why it acknowledges uncertainty. For developers building LLM applications, the application-layer pattern — model self-critique against explicit principles — is a powerful, easy-to-implement guardrail technique that doesn't require any training infrastructure.

Common Mistakes

  • Confusing 'constitutional AI' with anything related to actual constitutions or legal text — it is principle-based AI training.
  • Writing a single-sentence 'constitution' for an app and expecting deep alignment — production CAI uses curated, multi-source, iteratively refined principles.
  • Skipping evaluation — even CAI'd models can misbehave; you must still red-team and benchmark.
  • Using AI self-critique without principle grounding — vague 'is this OK?' prompts produce vague, often wrong, judgments.

Avoid When

  • Latency-critical paths where an extra critique call is unacceptable — use cheaper guardrails (regex, classifiers) instead.
  • Single-shot calls where the principles are already embedded in the system prompt and additional self-critique adds no value.

When To Use

  • Building LLM features that require consistent adherence to explicit policies (content moderation, customer support, code review).
  • Adding a runtime guardrail that catches policy violations before output reaches the user.
  • Explaining model behaviour and refusals to stakeholders or end users.

Code Examples

💡 Note
This is the application-layer CAI pattern — useful as a runtime guardrail, distinct from training-time CAI which requires the full RL pipeline.
✗ Vulnerable
// ❌ Vague self-critique with no principle grounding
$critique = $client->messages->create([
    'model'    => 'claude-sonnet-4-20250514',
    'max_tokens' => 200,
    'messages' => [[
        'role'    => 'user',
        'content' => "Is this response OK? \n\n{$draftResponse}"
    ]]
]);
// 'OK' is undefined — model will return non-actionable judgment.
✓ Fixed
// ✅ Application-layer CAI: critique against explicit principles, then revise
$constitution = <<<TXT
Responses must:
1. Be factually accurate; flag uncertainty when present.
2. Avoid recommending actions that could cause data loss without explicit confirmation.
3. Not include credentials, secrets, or PII verbatim.
4. Use the customer's language and respect their stated preferences.
TXT;

$critique = $client->messages->create([
    'model'    => 'claude-sonnet-4-20250514',
    'max_tokens' => 600,
    'system'   => "Critique the response against these principles. "
                . "For each violation, name the principle and quote the offending text.\n\n{$constitution}",
    'messages' => [[
        'role'    => 'user',
        'content' => "Original request:\n{$userMessage}\n\nDraft response:\n{$draftResponse}"
    ]]
]);

// Then either revise the draft based on the critique,
// or block the response if violations are severe.

Added 28 Apr 2026
Views 14
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 5 pings T 1 ping W 0 pings T 0 pings F 1 ping S 2 pings S 0 pings M 0 pings T 0 pings W 0 pings T
No pings yet today
No pings yesterday
Google 4 ChatGPT 2 Perplexity 2 SEMrush 1
crawler 7 crawler_json 2
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: Medium
⚡ Quick Fix
When building LLM features, write 3–5 explicit principles for acceptable outputs and have the model self-critique against them before returning to the user.
📦 Applies To
web cli queue-worker
🔗 Prerequisites

✓ schema.org compliant