{
    "slug": "constitutional_ai",
    "term": "Constitutional AI (CAI)",
    "category": "ai_ml",
    "difficulty": "advanced",
    "short": "Anthropic's training methodology where models critique and revise their own outputs against a set of written principles, reducing reliance on human labellers for alignment.",
    "long": "Constitutional AI (CAI) is Anthropic's approach to scaling AI alignment by replacing much of the human feedback in RLHF with AI-generated feedback against an explicit set of principles — the 'constitution'. Training runs in two phases. (1) SL-CAI: a supervised stage where the model generates a response, critiques itself against the constitution ('this response is harmful because...'), revises the response, and is fine-tuned on the revisions. (2) RLAIF (Reinforcement Learning from AI Feedback): a model labels which of two responses better follows the constitution, and the resulting preferences train a reward model used in standard RL. The constitution itself is a curated list of principles drawn from sources like the UN Declaration of Human Rights, terms of service, and safety considerations. CAI's key contribution is making alignment principles explicit and auditable rather than implicit in millions of human ratings, while reducing the human labour bottleneck. Claude is trained using a combination of RLHF and CAI. Developers can apply the same pattern at the application layer: have the model critique its own outputs against your principles before returning, an effective lightweight guardrail.",
    "aliases": [
        "CAI",
        "RLAIF",
        "AI feedback training",
        "principle-based alignment",
        "self-critique training"
    ],
    "tags": [
        "alignment",
        "training",
        "claude",
        "anthropic",
        "ai-safety",
        "ai"
    ],
    "misconception": "Constitutional AI replaces RLHF entirely or eliminates the need for human input. CAI reduces and complements RLHF — humans still author the constitution and run extensive evals. The 'AI feedback' is grounded in human-written principles; CAI scales human judgment rather than replacing it.",
    "why_it_matters": "CAI explains a lot about how Claude behaves: why it refuses certain requests, why it offers balanced perspectives, why it acknowledges uncertainty. For developers building LLM applications, the application-layer pattern — model self-critique against explicit principles — is a powerful, easy-to-implement guardrail technique that doesn't require any training infrastructure.",
    "common_mistakes": [
        "Confusing 'constitutional AI' with anything related to actual constitutions or legal text — it is principle-based AI training.",
        "Writing a single-sentence 'constitution' for an app and expecting deep alignment — production CAI uses curated, multi-source, iteratively refined principles.",
        "Skipping evaluation — even CAI'd models can misbehave; you must still red-team and benchmark.",
        "Using AI self-critique without principle grounding — vague 'is this OK?' prompts produce vague, often wrong, judgments."
    ],
    "when_to_use": [
        "Building LLM features that require consistent adherence to explicit policies (content moderation, customer support, code review).",
        "Adding a runtime guardrail that catches policy violations before output reaches the user.",
        "Explaining model behaviour and refusals to stakeholders or end users."
    ],
    "avoid_when": [
        "Latency-critical paths where an extra critique call is unacceptable — use cheaper guardrails (regex, classifiers) instead.",
        "Single-shot calls where the principles are already embedded in the system prompt and additional self-critique adds no value."
    ],
    "related": [
        "rlhf",
        "ai_alignment",
        "ai_guardrails",
        "ai_evaluation_metrics",
        "large_language_models"
    ],
    "prerequisites": [
        "rlhf",
        "ai_alignment"
    ],
    "refs": [
        "https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback",
        "https://arxiv.org/abs/2212.08073"
    ],
    "bad_code": "// ❌ Vague self-critique with no principle grounding\n$critique = $client->messages->create([\n    'model'    => 'claude-sonnet-4-20250514',\n    'max_tokens' => 200,\n    'messages' => [[\n        'role'    => 'user',\n        'content' => \"Is this response OK? \\n\\n{$draftResponse}\"\n    ]]\n]);\n// 'OK' is undefined — model will return non-actionable judgment.",
    "good_code": "// ✅ Application-layer CAI: critique against explicit principles, then revise\n$constitution = <<<TXT\nResponses must:\n1. Be factually accurate; flag uncertainty when present.\n2. Avoid recommending actions that could cause data loss without explicit confirmation.\n3. Not include credentials, secrets, or PII verbatim.\n4. Use the customer's language and respect their stated preferences.\nTXT;\n\n$critique = $client->messages->create([\n    'model'    => 'claude-sonnet-4-20250514',\n    'max_tokens' => 600,\n    'system'   => \"Critique the response against these principles. \"\n                . \"For each violation, name the principle and quote the offending text.\\n\\n{$constitution}\",\n    'messages' => [[\n        'role'    => 'user',\n        'content' => \"Original request:\\n{$userMessage}\\n\\nDraft response:\\n{$draftResponse}\"\n    ]]\n]);\n\n// Then either revise the draft based on the critique,\n// or block the response if violations are severe.",
    "example_note": "This is the application-layer CAI pattern — useful as a runtime guardrail, distinct from training-time CAI which requires the full RL pipeline.",
    "quick_fix": "When building LLM features, write 3–5 explicit principles for acceptable outputs and have the model self-critique against them before returning to the user.",
    "severity": "info",
    "effort": "medium",
    "created": "2026-04-28",
    "updated": "2026-04-28",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/constitutional_ai",
        "html_url": "https://codeclaritylab.com/glossary/constitutional_ai",
        "json_url": "https://codeclaritylab.com/glossary/constitutional_ai.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Constitutional AI (CAI)](https://codeclaritylab.com/glossary/constitutional_ai) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/constitutional_ai"
            }
        }
    }
}