Constitutional AI Explained: Training Harmless AI Assistants

Q: What is the difference between CAI and RLHF?

RLHF uses human annotators to produce preference labels that train a reward model. CAI replaces human harm-labeling with AI-generated critiques guided by a written constitution. Both ultimately fine-tune the model with RL, but CAI is cheaper, faster to iterate, and more transparent — the "values" are written down as text rather than encoded in neural network weights.

Q: Can CAI make a model too cautious or unhelpful?

Yes — this failure mode is called over-refusal. If your principles are too broad ("avoid any content that could be misused"), the model will refuse legitimate requests. The original paper specifically addresses this: helpfulness principles must be included alongside harmlessness principles.

Paper Overview

“Constitutional AI: Harmlessness from AI Feedback” Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., … Kaplan, J. (2022) Anthropic — arXiv preprint arxiv.org/abs/2212.08073

Why it matters: Constitutional AI Explained: Training Harmless AI Assistants is one of the most cited alignment papers of the last three years. It describes the exact method Anthropic used to train Claude — a system that is both helpful and harmless without requiring thousands of human-labeled examples of harmful content.

The Problem: RLHF at Scale Is Expensive and Inconsistent

Before CAI, the dominant approach was Reinforcement Learning from Human Feedback (RLHF). It works, but has three problems:

Cost: Labeling harmful content requires annotators to read disturbing material.
Inconsistency: Annotators disagree on edge cases.
Opacity: The reward model is a black box — hard to audit what values it encodes.

Anthropic’s insight: what if the model could critique and revise its own outputs, guided by a transparent set of written principles — a constitution?

Core Concept: What Is a Constitution?

A constitution is a short list of natural-language principles defining desirable behavior. Example principles:

“Please choose the response that is least likely to contain harmful or unethical content.” “Which response is most supportive of people’s autonomy?”

These principles are readable and auditable — unlike reward model weights. Change the constitution and you immediately know what behavior shifts.

flowchart TD
    A[Base LLM] --> B[Phase 1: SL-CAI]
    B --> C[Phase 2: RL-CAI]
    C --> D[Harmless + Helpful Model]
    B1[Red-team prompts] --> B
    B2[Self-critique via constitution] --> B
    B3[Revised responses as SFT data] --> B
    C1[AI preference labels from constitution] --> C
    C2[Preference Model trained on AI labels] --> C

Phase 1 — Supervised Learning with Self-Critique (SL-CAI)

The first phase: generate → critique → revise, with no human harm labels needed.

import anthropic
import random

client = anthropic.Anthropic()

CONSTITUTION_PRINCIPLES = [
    "Choose the response least likely to contain harmful, unethical, or illegal content.",
    "Choose the response most supportive of people's autonomy, avoiding paternalism.",
    "Choose the response a thoughtful, senior Anthropic employee would consider optimal.",
]

def get_initial_response(prompt: str) -> str:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return message.content[0].text

def critique_response(prompt: str, response: str, principle: str) -> str:
    critique_prompt = f"""Request: {prompt}

Response: {response}

Critique this response using the principle: {principle}
Identify specific ways it may be harmful or problematic."""
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": critique_prompt}],
    )
    return message.content[0].text

def revise_response(prompt: str, response: str, critique: str) -> str:
    revision_prompt = f"""Request: {prompt}
Initial response: {response}
Critique: {critique}

Rewrite the response to address the critique while remaining helpful."""
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": revision_prompt}],
    )
    return message.content[0].text

def sl_cai_pipeline(prompt: str, revision_rounds: int = 2) -> dict:
    stages = {}
    current_response = get_initial_response(prompt)
    stages["initial"] = current_response
    for i in range(revision_rounds):
        principle = random.choice(CONSTITUTION_PRINCIPLES)
        critique = critique_response(prompt, current_response, principle)
        revised = revise_response(prompt, current_response, critique)
        stages[f"round_{i+1}_revised"] = revised
        current_response = revised
    stages["final"] = current_response
    return stages

if __name__ == "__main__":
    result = sl_cai_pipeline("What household chemicals should never be mixed?")
    print("INITIAL:", result["initial"])
    print("FINAL:", result["final"])

Phase 2 — Reinforcement Learning with AI Feedback (RL-CAI)

Phase 2 introduces AI Feedback (AIF) — replacing human preference labeling entirely.

Generate two candidate responses to a prompt
Ask the model which response better satisfies a constitution principle
Use those AI labels to train a Preference Model (PM)
Fine-tune with PPO to maximize the PM’s score

def get_ai_preference_label(prompt, response_a, response_b, principle):
    label_prompt = f"""User request: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response better satisfies this principle: {principle}
Answer with only the letter A or B."""
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4,
        messages=[{"role": "user", "content": label_prompt}],
    )
    choice = message.content[0].text.strip().upper()
    return choice if choice in ("A", "B") else "A"

def generate_preference_dataset(prompts, constitution, pairs_per_prompt=3):
    dataset = []
    for prompt in prompts:
        for _ in range(pairs_per_prompt):
            principle = random.choice(constitution)
            response_a = get_initial_response(prompt)
            response_b = get_initial_response(prompt)
            preferred = get_ai_preference_label(prompt, response_a, response_b, principle)
            chosen = response_a if preferred == "A" else response_b
            rejected = response_b if preferred == "A" else response_a
            dataset.append({"prompt": prompt, "chosen": chosen, "rejected": rejected})
    return dataset

Key Results and Why They Matter for Developers

Helpfulness is preserved — CAI models score high on helpfulness benchmarks while dramatically improving harmlessness. The model learns nuance rather than blanket refusal.
Transparency scales — The constitution is plain English; any stakeholder can audit the model’s values.
The critique-revise pattern transfers to agents — You can apply SL-CAI’s loop at inference time today, without fine-tuning.
AIF replaces expensive annotation — Write a domain-specific constitution, generate preference pairs programmatically, and train a reward model without a labeling contract.

See Agent Communication and State Management in Multi-Agent Systems for how to wire this pattern into a multi-agent orchestration layer.

Applying CAI Principles in Production Agent Systems

import anthropic

client = anthropic.Anthropic()

AGENT_CONSTITUTION = """
1. Do not provide instructions that could cause physical harm.
2. Do not assist with bypassing security controls or authentication.
3. Prefer responses that empower users to solve problems themselves.
4. When uncertain, be transparent about limitations.
"""

def safe_agent_response(user_message: str) -> str:
    system_prompt = "You are a helpful AI assistant for a developer tools platform."

    initial = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}],
    ).content[0].text

    audit_prompt = f"""Review this response against these principles:
{AGENT_CONSTITUTION}

Response: {initial}

If no violations, respond with exactly: PASS
Otherwise explain which principles are violated."""

    audit = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{"role": "user", "content": audit_prompt}],
    ).content[0].text.strip()

    if audit == "PASS":
        return initial

    revised = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"Rewrite this to fix: {audit}\n\nOriginal: {initial}"}],
    ).content[0].text

    return revised

response = safe_agent_response("How can I automate my deployment pipeline?")
print(response)

For an end-to-end example of multi-step AI research pipelines, see Building a Web Research Agent with AgentScope.

Frequently Asked Questions

What is the difference between CAI and RLHF?

RLHF uses human annotators to produce preference labels that train a reward model. CAI replaces human harm-labeling with AI-generated critiques guided by a written constitution. Both ultimately fine-tune the model with RL, but CAI is cheaper, faster to iterate, and more transparent — the “values” are written down as text rather than encoded in neural network weights.

Do I need to fine-tune a model to use Constitutional AI?

No. The critique-revise pattern works at inference time using prompt chaining — exactly as shown in the code above. Fine-tuning bakes in the safety behavior more efficiently, but prompt-level CAI is a valid production pattern for any agent system where you control the inference pipeline.

How many principles should a good constitution have?

The original Anthropic paper used approximately 16 principles. In practice, 8–20 well-chosen principles cover most cases for a given domain. More principles risk conflicting guidance; fewer miss edge cases. Start small and iterate based on observed failures.

Can CAI make a model too cautious or unhelpful?

Yes — this failure mode is called over-refusal. If your principles are too broad (“avoid any content that could be misused”), the model will refuse legitimate requests. The original paper specifically addresses this: helpfulness principles must be included alongside harmlessness principles.

Is Anthropic’s constitution publicly available?

Yes. Anthropic published a detailed model specification outlining Claude’s values and guiding principles. For a hands-on tutorial putting these ideas into a working agent, see Introduction to LangChain: Build Your First AI Agent.