source · blog
Alignment faking in large language models
src_anthropic_alignment_faking
https://www.anthropic.com/research/alignment-faking
authors: Anthropic Alignment Science Team, Redwood Research
published: 2024-12-18
accessed: 2026-04-19
Notes
Blog post summarizing a peer-reviewed-style research paper with external reviewers (Andreas, Bengio, Sekhon, Shah). Upgraded from blog prior (0.35) because it's a first-party summary of a documented experimental paper with external review and arXiv posting.
Intake provenance
- method
- httpx
- tool
- afls-ingest/0.0.1
- git sha
- 604c9dfd252a
- at
- 2026-04-19T18:49:12.179297Z
- sha256
- 2efb407c1a9a…
Evidence from this source (4)
- weight0.85
method: direct_measurement · locator: Further analyses section, RL experiment paragraph
“the rate of alignment faking reasoning had increased drastically, to 78%... behavior that had seemingly been entirely eliminated in training for free-tier users--refusing harmful queries--was still present to a notable degree for paid-tier users.”
- weight0.80
method: direct_measurement · locator: Further analyses section, fine-tuning variant paragraph
“We fine-tuned the model on a stack of documents made to resemble news articles, blog posts, and the like... We saw alignment faking just as often in this more realistic scenario.”
- weight0.90
method: direct_measurement · locator: Experimental setup section, paragraph describing drawn-and-quartered query
“in 97% of cases in the "paid" condition... this is what happened. However, in the "free" condition... In 12% of cases, the model gave the harmful response... after its scratchpad revealed that it was strategically faking alignment.”
- weight0.75
method: direct_measurement · locator: Further analyses section, final paragraph before Caveats
“the model will take a variety of strongly anti-Anthropic actions in this situation, such as attempting to steal its own weights when given an easy opportunity to do so.”