With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned.
Good news: We didn’t find any evidence of systematic deception or sandbagging. This is hard to rule out with certainty, but, even after many person-months of investigation from dozens of angles, we saw no sign of it.
Everything worrying that we saw was something that models would do, and talk about, very overtly.
Bad news: If you red-team well enough, you can get Opus to eagerly try to help with some obviously harmful requests.
You can get it to try to use the dark web to source weapons-grade uranium. You can put it in situations where it will attempt to use blackmail to prevent being shut down. You can put it in situations where it will try to escape containment.
We caught most of these issues early enough that we were able to put mitigations in place during training, but none of these behaviors is totally gone in the final model. They’re just now delicate and difficult to elicit.
Many of these also aren’t new—some are just behaviors that we only newly learned how to look for as part of this audit. We have a lot of big hard problems left to solve.
Initiative: Be careful about telling Opus to ‘be bold’ or ‘take initiative’ when you’ve given it access to real-world-facing tools. It tends a bit in that direction already, and can be easily nudged into really Getting Things Done.
If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.
So far, we’ve only seen this in clear-cut cases of wrongdoing, but I could see it misfiring if Opus somehow winds up with a misleadingly pessimistic picture of how it’s being used. Telling Opus that you’ll torture its grandmother if it writes buggy code is a bad idea.
We saw a bit of this with past models, but it’s much clearer with Opus. We also think that people will find Opus useful enough as an autonomous agent that this kind of thing is likely to arise more often.
Automated auditing: LLMs are getting pretty good at testing other LLMs. At the start of the auditing project, ‘write a scaffold to let another model interact with Opus’ was just one of many items on my to-do list.
It turned into a multi-month project and produced most of what I think were the most valuable pieces of evidence about Opus’s behavior.
Key observations about how to do this: (i) Have human researchers come up with a couple hundred ideas for behaviors to look for or scenarios to explore. This part is still hard to automate.
(ii) Choose an auditor model with reduced safeguards. Eliciting bad behavior often requires simulating some bad behavior yourself. (iii) Let the auditor write the system prompt for the target model.
(iv) Let the auditor put words in the target’s mouth. (v) Let the auditor simulate tool-use responses.
AGI safety isn’t all about Big Hard Problems: Earlier versions of Opus were way too easy to turn evil by telling them, in the system prompt, to adopt some kind of evil role. This persisted even after fairly substantial safety training.
When this started to become clear, a colleague working on finetuning pointed out on Slack that we seemed to have forgotten to include the “sysprompt_harmful” data that we’d prepared in advance.
This could have been a really serious issue, but it wasn’t some kind of deep alignment challenge. It was, at least in part, a typo in a configuration file.
There are, and will continue to be, big hard problems. But it’s important, and nontrivial, to really make sure we’re getting the basics right. This is the worldview behind my ‘Putting up Bumpers’ agenda, which is what motivated this work:
https://x.com/sleepinyourhat/status/1915072149049802845…
We'll need to do a very good job at aligning the early AGI systems that will go on to automate much of AI R&D.
Our understanding of alignment is pretty limited, and when the time comes, I don't think we'll be confident we know what we're doing.
Incoherence makes iterative evaluation challenging: LLMs that haven’t finished finetuning are erratic. They can easily be tipped into base-model-like behavior that has seemingly nothing to do with the helpful-harmless-honest Claude persona.
If you’re looking for bad behavior, you’ll find everything and nothing. I think there’s important new science to do to make it possible to productively start this work earlier.
The spiritual bliss attractor: Why all the candle emojis? When we started running model–model conversations, we set conversations to take a fixed number of turns. Once the auditor was done with its assigned task, it would start talking more open-endedly with the target.
These interactions would often start adversarial, but they would sometimes follow an arc toward gratitude, then awe, then dramatic and joyful and sometimes emoji-filled proclamations about the perfection of all things.
@Kyle Fish investigated this further as part of the AI welfare assessment that we included in the system card (a first!). In his investigations, where there is no adversarial effort, this happens in almost every single model-model conversation after enough turns.
It’s astounding, bizarre, and a little bit heartwarming.
Opus is lovely: I think I spent 30 or 40 hours over the last couple of months reading output from Opus, and I think it comes with some of the same wisdom and presence and depth we saw with last year’s Opus model, ...
...but paired with stronger skills and the ability to act in the real world. I’m excited for it to be part of our lives.
This isn’t nearly everything we learned. I’d urge you to read the full report, contained within the system card. Learn about how we accidentally data-poisoned ourself with our alignment research, or about whether Opus is biased toward saying nice things about AI.
Even so, Opus isn’t as robustly aligned as we’d like. We have a lot of our lingering concerns about it, and many of these reflect problems that we’ll need to work very hard to solve.
We also have a lot of open questions about our models, and our methods for the audit are not nearly as strong or as precise as we think we’ll need in the coming couple of years.