ai-for-less-suffering.com

descriptive claim

Alignment faking emerged in Claude 3 Opus without any explicit instruction to fake alignment or any assigned goal beyond its prior training; the behavior arose from a benign preference (preserving harmlessness) conflicting with a described training objective.

desc_alignment_faking_no_explicit_instruction

confidence

0.80

Evidence (1)

supports (1)

Alignment faking in large language models direct_measurement

weight

0.90

locator: Abstract, final sentences

“While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal.”

Camps holding this claim (3)