Thread Reader
Hjalmar Wijk

Hjalmar Wijk
@HjalmarWijk

Nov 26, 2025
6 tweets
Tweet

Anthropic says in their system card that *all* their AI R&D evals are close to saturation, and report a median self-reported uplift of 2X (mean over 3X!) for power users. They provide very little evidence ruling out imminent dramatic AI R&D acceleration. x.com/eli_lifland/st

Eli Lifland

Eli Lifland
@eli_lifland

On Anthropic's AI R&D suite 1, Opus 4.5 is similar to or better than human baselines on the majority of tasks. Summing up the results: (a) Roughly equivalent to human given 8 hours (b) A bit better than human given 4 hours (c) Close to human given 8 hours (d) Better than human given 4-8 hours (e) Somewhat better than human given 4 hours (f) Worse than human given 40 hours So based on these very few data points maybe an 8-24 hour 50% time horizon on this task distribution?
I personally suspect that their self-report uplift numbers are inflated and that agent time horizons are still limited. But if taken at face value, then even the most aggressive scenarios (e.g. AI 2027 or blog.redwoodresearch.org/p/whats-up-wit) would have underestimated progress.
Such dramatic and surprising progress (if it held up to scrutiny) would make me think that a dramatic 'software intelligence explosion' was much more likely in the next few years, and I worry we'd be far from ready for such an outcome. forethought.org/research/how-q
Mostly I think this indicates that we urgently need better evals and measurement. I suspect that a proper uplift RCT, and more challenging AI R&D evaluations, would reveal Opus 4.5 is on a much more modest trend and still far away from the AI R&D 4 threshold (fully automating remote junior employees). But we need to build and run those assessments to know for sure, and the stakes are very high. x.com/CFGeek/status/
Anthropic has decided to activate their mitigations for AI R&D 4 early, and so claim this assessment is no longer load-bearing. I'm glad they are getting started on their risk reports, but I hope this doesn't mean deprioritizing efforts to bridge this critical measurement gap.
Hoping to dig much more into this in the near future (e.g. get an Opus 4.5 time horizon x.com/joel_bkr/statu). If you want to be on the ground-floor of reconciling the contradictory evidence on this crucial question, consider applying to work with us! metr.org/careers
Joel Becker

Joel Becker
@joel_bkr

people seem to appreciate this (very incomplete) list of reasons why METR time horizon evals can take so long after model releases. high-quality science is hard! x.com/joel_bkr/statu
Hjalmar Wijk

Hjalmar Wijk

@HjalmarWijk
Member of Technical Staff @ METR Trying to understand the transformative impacts of AI early enough to give the world a chance to react + shape what's to come
Follow on 𝕏
Missing some tweets in this thread? Or failed to load images or videos? You can try to .