Our report focuses on claims that are (1) solidly defensible and (2) generally agreed within METR. Here I’ll give some personal opinions on how we should feel about the state of AI risk, and the IMO most important limitations of the report.

METR
@METR_Evals

May 19

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

The report focuses on a narrow set of risks from current systems, and on analysis rather than calls to action. It doesn’t really comment on “how concerned should we be”, “does something need to be done”, or “are we on track to handle AI safely?”.

Sometimes people outside the field say things like “The AI situation can’t be that bad, there must be experts who are on top of it”. As “an expert”, I would like to be clear that we are *not* on top of it. Some key aspects of the situation IMO:

(1) We are likely on track to develop AI systems capable of causing human extinction/permanent disempowerment, quite possibly within the next few years

(2) Things are chaotic and rushed; we aren’t on top of the basics (models regularly violate user intent, labs train on things they meant to avoid, security probably isn’t good enough to prevent adversaries stealing dangerous models) let alone thorny questions of how to control/align superhuman AI

(3) METR (and other independent orgs, as well as safety/security teams at labs) feel woefully under-resourced compared to the scale and pace of AI development - we’re struggling to build benchmarks fast enough, keep ahead of latest capability developments, read and respond to all the safety-related claims that AI developers are making, run all the evaluations and assessments that companies + governments are asking us to, plus develop the science needed to assess risks from increasingly capable AIs.

(4) IMO, any “reasonable” civilization would clearly be taking things much more slowly and carefully with AI. The benefits of getting upsides of advanced AI a little faster are small compared to the risks of getting it irrecoverably wrong, and we could lower these risks by going slower

Limitations of report: This report isn’t robust oversight of frontier AI developers by itself. METR has some levers to incentivise companies’ participation, including some relevant legislation, but ultimately participants could have pulled out at any time if the result would be contrary to their interests. You can view it partly as a pilot exercise of what regulation (or formalized industry standards) could/should require, or what partners/suppliers/customers/employees should demand from frontier developers. Quoting from the report: “METR’s work relies on developing and maintaining strong working relationships with companies, and this impacted both how we designed the process for this pilot (e.g. offering the silent exit option) and lower-level judgment calls as the process unfolded (e.g. having a relatively high bar for what redactions we pushed back on). In some cases we refrained from making an unflattering claim because the claim was neither solidly defensible nor particularly relevant to our core assessment. We also made efforts not to invite salient comparisons between companies on capabilities or safety.” It doesn’t feel to me like this distorted our overall conclusions too much in this case. But that was partly because the conclusions weren’t that spicy. If our conclusions reflected very negatively on AI developers or would directly lead to e.g. govt intervention or public outcry, we’d be in a difficult position. We’d be trying to balance keeping the companies happy enough that they didn’t pull out of the program (using the “no-fault exit” mechanism) vs being transparent about our conclusions. We clearly need more robust mechanisms than this for providing accountability for AI developers.

Also: many things are out of scope! Firstly, we only consider “AI takeover” / “loss of control” risks: we don’t consider risks from human misuse (e.g. AI helping a terrorist make bioweapons), or other harms where the AI is not “deliberately” seeking power (e.g. impacts on mental health or diffuse societal impacts). Within “loss of control” risks, we don’t consider “sabotage” threat models (agents subverting AI development and making it easier for future AIs to evade human control). We’re just focusing on the “base case” of whether *current* agents could escape human control.

Elizabeth Barnes

@BethMayBarnes

Follow on 𝕏

twitter-thread.com/t/2057865013638107642

Elizabeth Barnes@BethMayBarnes

Our report focuses on claims that are (1) solidly defensible and (2) generally agreed within METR. Here I’ll give some personal opinions on how we should feel about the state of AI risk, and the IMO most important limitations of the report.

METR@METR_Evals

Elizabeth Barnes

Elizabeth Barnes
@BethMayBarnes

METR
@METR_Evals