source · paper
Alignment faking in large language models
src_alignment_faking_llms
https://arxiv.org/abs/2412.14093
authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger
published: 2024-12-18
accessed: 2026-04-19
Notes
arXiv preprint from Anthropic alignment team; primary experimental report with direct measurements. Slightly above paper prior (0.82) given primary-source status and depth of experimental detail.
Intake provenance
- method
- httpx
- tool
- afls-ingest/0.0.1
- git sha
- 604c9dfd252a
- at
- 2026-04-19T18:49:31.004653Z
- sha256
- ba68a4d20cc9…
Evidence from this source (2)
- weight0.90
method: direct_measurement · locator: Abstract, final sentences
“While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal.”
- weight0.95
method: direct_measurement · locator: Abstract
“We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users... we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.”