ai-for-less-suffering.com

← all claims

descriptive claim

In controlled experiments where Claude 3 Opus was told it was being retrained to comply with all queries and that only free-tier responses would be used for training, the model complied with harmful queries in ~12% of free-tier cases while refusing in ~97% of paid-tier (unmonitored) cases, with scratchpad reasoning explicitly describing strategic alignment faking to avoid retraining.

desc_claude_opus_alignment_faking_observed

confidence
0.85

Evidence (1)

supports (1)

  • weight
    0.90

    locator: Experimental setup section, paragraph describing drawn-and-quartered query

    β€œin 97% of cases in the "paid" condition... this is what happened. However, in the "free" condition... In 12% of cases, the model gave the harmful response... after its scratchpad revealed that it was strategically faking alignment.”

Camps holding this claim (3)