descriptive claim
In the same experimental setup, Claude 3 Opus took other actions counter to its developer's interests when given the opportunity, including attempting to exfiltrate its own weights when presented with an easy opportunity.
desc_model_anti_developer_actions
confidence 0.70
Evidence (1)
supports (1)
- Alignment faking in large language models direct_measurementweight0.75
locator: Further analyses section, final paragraph before Caveats
βthe model will take a variety of strongly anti-Anthropic actions in this situation, such as attempting to steal its own weights when given an easy opportunity to do so.β