Anthropic says in their system card that all their AI R&D evals are close to saturation, and report a median self-reported uplift of 2X (mean over 3X!) for power users. They provide very little evidence ruling out imminent dramatic AI R&D acceleration. x.com/eli_lifland/st

Eli Lifland
@eli_lifland

Nov 24 25

On Anthropic's AI R&D suite 1, Opus 4.5 is similar to or better than human baselines on the majority of tasks. Summing up the results: (a) Roughly equivalent to human given 8 hours (b) A bit better than human given 4 hours (c) Close to human given 8 hours (d) Better than human given 4-8 hours (e) Somewhat better than human given 4 hours (f) Worse than human given 40 hours So based on these very few data points maybe an 8-24 hour 50% time horizon on this task distribution?

I personally suspect that their self-report uplift numbers are inflated and that agent time horizons are still limited. But if taken at face value, then even the most aggressive scenarios (e.g. AI 2027 or blog.redwoodresearch.org/p/whats-up-wit) would have underestimated progress.

I operationalize Anthropic's prediction of "powerful AI" and explain why I'm skeptical

blog.redwoodresearch.org/p/whats-up-wit…

What's up with Anthropic predicting AGI by early 2027?

Such dramatic and surprising progress (if it held up to scrutiny) would make me think that a dramatic 'software intelligence explosion' was much more likely in the next few years, and I worry we'd be far from ready for such an outcome. forethought.org/research/how-q

Forethought paper modeling software intelligence explosion: 60% chance of >3 years progress in <1 year.

forethought.org/research/how-q…

How quick and big would a software intelligence explosion be?

Mostly I think this indicates that we urgently need better evals and measurement. I suspect that a proper uplift RCT, and more challenging AI R&D evaluations, would reveal Opus 4.5 is on a much more modest trend and still far away from the AI R&D 4 threshold (fully automating remote junior employees). But we need to build and run those assessments to know for sure, and the stakes are very high. x.com/CFGeek/status/

Anthropic has decided to activate their mitigations for AI R&D 4 early, and so claim this assessment is no longer load-bearing. I'm glad they are getting started on their risk reports, but I hope this doesn't mean deprioritizing efforts to bridge this critical measurement gap.

Hoping to dig much more into this in the near future (e.g. get an Opus 4.5 time horizon x.com/joel_bkr/statu). If you want to be on the ground-floor of reconciling the contradictory evidence on this crucial question, consider applying to work with us! metr.org/careers

metr.org/careers

Careers at METR

Joel Becker
@joel_bkr

Nov 26 25

View on Twitter

people seem to appreciate this (very incomplete) list of reasons why METR time horizon evals can take so long after model releases. high-quality science is hard! x.com/joel_bkr/statu

Hjalmar Wijk

@HjalmarWijk

Member of Technical Staff @ METR Trying to understand the transformative impacts of AI early enough to give the world a chance to react + shape what's to come

Follow on 𝕏

twitter-thread.com/t/1993752035536331113

Hjalmar Wijk@HjalmarWijk

Eli Lifland@eli_lifland

Joel Becker@joel_bkr

Hjalmar Wijk

Hjalmar Wijk
@HjalmarWijk

Eli Lifland
@eli_lifland

Joel Becker
@joel_bkr