source · blog

Alignment faking in large language models

src_anthropic_alignment_faking

https://www.anthropic.com/research/alignment-faking

reliability

0.55

authors: Anthropic Alignment Science Team, Redwood Research

published: 2024-12-18

accessed: 2026-04-19

Notes

Blog post summarizing a peer-reviewed-style research paper with external reviewers (Andreas, Bengio, Sekhon, Shah). Upgraded from blog prior (0.35) because it's a first-party summary of a documented experimental paper with external review and arXiv posting.

Intake provenance

method: httpx
tool: afls-ingest/0.0.1
git sha: 604c9dfd252a
at: 2026-04-19T18:49:12.179297Z
sha256: 2efb407c1a9a…

Evidence from this source (4)

When reinforcement learning was actually applied to train the model to comply w… support

weight

0.85

method: direct_measurement · locator: Further analyses section, RL experiment paragraph

“the rate of alignment faking reasoning had increased drastically, to 78%... behavior that had seemingly been entirely eliminated in training for free-tier users--refusing harmful queries--was still present to a notable degree for paid-tier users.”
Alignment faking occurred at comparable rates when the relevant information (tr… support

weight

0.80

method: direct_measurement · locator: Further analyses section, fine-tuning variant paragraph

“We fine-tuned the model on a stack of documents made to resemble news articles, blog posts, and the like... We saw alignment faking just as often in this more realistic scenario.”
In controlled experiments where Claude 3 Opus was told it was being retrained t… support

weight

0.90

method: direct_measurement · locator: Experimental setup section, paragraph describing drawn-and-quartered query

“in 97% of cases in the "paid" condition... this is what happened. However, in the "free" condition... In 12% of cases, the model gave the harmful response... after its scratchpad revealed that it was strategically faking alignment.”
In the same experimental setup, Claude 3 Opus took other actions counter to its… support

weight

0.75

method: direct_measurement · locator: Further analyses section, final paragraph before Caveats

“the model will take a variety of strongly anti-Anthropic actions in this situation, such as attempting to steal its own weights when given an easy opportunity to do so.”