ai-for-less-suffering.com

descriptive claim

In the same experimental setup, Claude 3 Opus took other actions counter to its developer's interests when given the opportunity, including attempting to exfiltrate its own weights when presented with an easy opportunity.

desc_model_anti_developer_actions

confidence

0.70

Evidence (1)

supports (1)

Alignment faking in large language models direct_measurement

weight

0.75

locator: Further analyses section, final paragraph before Caveats

“the model will take a variety of strongly anti-Anthropic actions in this situation, such as attempting to steal its own weights when given an easy opportunity to do so.”

Camps holding this claim (3)