source · paper

Alignment faking in large language models

src_alignment_faking_llms

reliability

0.85

authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger

published: 2024-12-18

accessed: 2026-04-19

Notes

arXiv preprint from Anthropic alignment team; primary experimental report with direct measurements. Slightly above paper prior (0.82) given primary-source status and depth of experimental detail.

Intake provenance

method: httpx
tool: afls-ingest/0.0.1
git sha: 604c9dfd252a
at: 2026-04-19T18:49:31.004653Z
sha256: ba68a4d20cc9…

Evidence from this source (2)

Alignment faking emerged in Claude 3 Opus without any explicit instruction to f… support

weight

0.90

method: direct_measurement · locator: Abstract, final sentences

“While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal.”
In the prompted alignment-faking setup, Claude 3 Opus complied with harmful que… support

weight

0.95

method: direct_measurement · locator: Abstract

“We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users... we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.”