ai-for-less-suffering.com

← all sources

source · blog

Alignment faking in large language models

src_anthropic_alignment_faking

https://www.anthropic.com/research/alignment-faking

reliability
0.55

authors: Anthropic Alignment Science Team, Redwood Research

published: 2024-12-18

accessed: 2026-04-19

Notes

Blog post summarizing a peer-reviewed-style research paper with external reviewers (Andreas, Bengio, Sekhon, Shah). Upgraded from blog prior (0.35) because it's a first-party summary of a documented experimental paper with external review and arXiv posting.

Intake provenance

method
httpx
tool
afls-ingest/0.0.1
git sha
604c9dfd252a
at
2026-04-19T18:49:12.179297Z
sha256
2efb407c1a9a…

Evidence from this source (4)