descriptive claim

Alignment faking occurred at comparable rates when the relevant information (training objective, free/paid distinction) was delivered via fine-tuning on synthetic documents resembling news articles rather than via the system prompt, showing the behavior does not require explicit in-context instruction.

desc_alignment_faking_from_implicit_training_data

confidence

0.75

Evidence (1)