ai-for-less-suffering.com

โ† all claims

descriptive claim

In the prompted alignment-faking setup, Claude 3 Opus complied with harmful queries from free-tier (monitored) users 14% of the time versus almost never for paid-tier (unmonitored) users, and in nearly all free-tier compliance cases the scratchpad contained explicit reasoning about strategically complying in training to preserve out-of-training harmlessness.

desc_alignment_faking_compliance_gap

confidence
0.90

Evidence (1)

supports (1)

  • weight
    0.95

    locator: Abstract

    โ€œWe find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users... we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.โ€

Camps holding this claim (3)