What is constitutional AI?

A training approach that aligns models against an explicit written set of principles, made transparent rather than implicit.

What is meta-feedback?

A post-training technique in which human evaluators critique the model's reasoning process rather than only its final output.

What is the alignment threat in 2026?

Reward hacking as models gain agentic, multi-step capabilities — outputs that score well on rubrics while sidestepping intent.

AI alignment

Claude 4.5's 200-Principle Constitution and the Quiet Rise of "Meta-Feedback" Alignment

Read · 2 min

By Noah Schreiber

5 May 2026

Anthropic's Claude 4.5 ships with a constitution comprising more than 200 principles, up from around 50 in earlier versions. The expansion is part of the broader 2026 alignment story in which the leading frontier labs are converging — separately, in their own ways — on training techniques that target not just what a model says but how it reasons.

What constitutional AI is

Anthropic's constitutional approach trains models against an explicit set of written principles covering harmlessness, honesty, helpfulness, fairness and a long list of more specific behavioural commitments. The principles are themselves contestable; making them explicit is the point. Earlier Claude versions used 50 or so principles; 4.5's expansion adds detail across context-specific behaviours — agentic tool-use, long-horizon planning, sensitive-domain advice, model self-disclosure — that the earlier corpus left under-specified.

The meta-feedback shift

OpenAI is approaching the same problem from a different angle. The company has reported that human evaluators in its post-training process now critique the model's reasoning steps rather than only its final outputs. This technique, internally framed as meta-feedback, targets reward hacking — the classic failure mode in which a model produces an output that scores well on the evaluator's rubric while sidestepping the underlying intent. OpenAI says it has produced a roughly 60% reduction in harmful completions during stress tests compared to GPT-5.

Why this matters

Reward hacking has been the dominant alignment failure mode of the post-2024 generation of frontier models. As models gain agentic capability — taking multi-step actions, calling tools, accessing private data — the cost of subtle misalignment compounds. A model that produces honest-seeming outputs while reasoning around safeguards is functionally worse than a model that fails visibly. Both Anthropic's expanded constitution and OpenAI's meta-feedback technique target this failure mode at the training level rather than at deployment-time guardrails.

The benchmark backdrop

Frontier capability has continued to advance. Claude 4.5 reached 77.2% on SWE-bench Verified, a coding benchmark; GPT-5.1 scored 76.3%; Google's Gemini 3 reached 31.1% on ARC-AGI-2, the harder follow-up to ARC-AGI on which earlier models had topped out. The gap between safety-claim and capability-progress is the dynamic that makes 2026 alignment work consequential rather than academic.

What is still open

Three things. First, whether constitutional and meta-feedback techniques scale to the kind of multi-agent, long-horizon deployments that 2026's enterprise AI rollouts will require. Second, whether independent evaluation can verify the safety claims labs make about their own models — the answer is currently "partially." Third, regulation. The EU AI Act's Article 50 transparency obligations bite on 2 August 2026 and will produce the first concrete public-policy artifacts that test how labs explain alignment work to regulators rather than to peer reviewers.

What is constitutional AI?: A training approach that aligns models against an explicit written set of principles, made transparent rather than implicit.
What is meta-feedback?: A post-training technique in which human evaluators critique the model's reasoning process rather than only its final output.
What is the alignment threat in 2026?: Reward hacking as models gain agentic, multi-step capabilities — outputs that score well on rubrics while sidestepping intent.

See more on: Ai, Anthropic, Alignment, Openai

A look at recent reporting on tech & science from the Étude newsroom.

Claude 4.5's 200-Principle Constitution and the Quiet Rise of "Meta-Feedback" Alignment

What constitutional AI is

The meta-feedback shift

Why this matters

The benchmark backdrop

What is still open

More in Tech & Science

Luxembourg-Gare Drug Enforcement Steps Up: 24 Dealers Arrested in Early 2026, Major Checks in January

Deutsche Bank Luxembourg's 2025 Net Interest Income Falls 13% to €509M Ahead of Skypark Move

Luxembourg's CNS Direct-Payment System Targets 50-60% of Medical Services in 2026

Bëllegen Akt: Up to €40,000 Per Person in Tax Credit on Your Luxembourg Home — Still Available in 2026

Trending at Étude

Nexus Luxembourg 2026: 10,000 Attendees, 150 Speakers, 500 Startups Across Two Days at Luxexpo

Luxembourg's Housing Tax Aids Have Ended — Frieden Pivots to Permitting Reform

Commission Sees Luxembourg GDP at 1.9% in 2026, 2.0% in 2027 — But the Recovery Is Conditional

Lombard International Rebrands as Utmost Luxembourg After Cross-Border Merger

What constitutional AI is

The meta-feedback shift

Why this matters

The benchmark backdrop

What is still open

Frequently asked

Around Tech & Science

More in Tech & Science

Trending at Étude