Thread Reader
Elizabeth Barnes

Elizabeth Barnes
@BethMayBarnes

Mar 20, 2025
1 tweets
Tweet

Persnickety title would be: "there's an exponential trend with doubling time between ~2 -12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions". More detail on how we thought about external validity in paper and this thread

Megan Kinniment

Megan Kinniment
@MKinniment

Happy for this to be released! It’s the result of a lot of hard work from many of us at METR :) A big question is whether these results apply to ‘real’ tasks. Here’s some thoughts on that:
Missing some tweets in this thread? Or failed to load images or videos? You can try to .