ai-for-less-suffering.com

โ† all claims

descriptive claim

Alignment faking occurred at comparable rates when the relevant information (training objective, free/paid distinction) was delivered via fine-tuning on synthetic documents resembling news articles rather than via the system prompt, showing the behavior does not require explicit in-context instruction.

desc_alignment_faking_from_implicit_training_data

confidence
0.75

Evidence (1)

supports (1)

  • weight
    0.80

    locator: Further analyses section, fine-tuning variant paragraph

    โ€œWe fine-tuned the model on a stack of documents made to resemble news articles, blog posts, and the like... We saw alignment faking just as often in this more realistic scenario.โ€

Camps holding this claim (3)