ai-for-less-suffering.com

← all sources

source · paper

Alignment faking in large language models

src_alignment_faking_llms

https://arxiv.org/abs/2412.14093

reliability
0.85

authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger

published: 2024-12-18

accessed: 2026-04-19

Notes

arXiv preprint from Anthropic alignment team; primary experimental report with direct measurements. Slightly above paper prior (0.82) given primary-source status and depth of experimental detail.

Intake provenance

method
httpx
tool
afls-ingest/0.0.1
git sha
604c9dfd252a
at
2026-04-19T18:49:31.004653Z
sha256
ba68a4d20cc9…

Evidence from this source (2)