Persnickety title would be: "there's an exponential trend with doubling time between ~2 -12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions". More detail on how we thought about external validity in paper and this thread
Happy for this to be released!
It’s the result of a lot of hard work from many of us at METR :)
A big question is whether these results apply to ‘real’ tasks.
Here’s some thoughts on that: