Persnickety title would be: "there's an exponential trend with doubling time between ~2 -12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions". More detail on how we thought about external validity in paper and this thread

Megan Kinniment
@MKinniment

Mar 19 25

Happy for this to be released! It’s the result of a lot of hard work from many of us at METR :) A big question is whether these results apply to ‘real’ tasks. Here’s some thoughts on that:

Elizabeth Barnes

@BethMayBarnes

Follow on 𝕏

twitter-thread.com/t/1902759691727540329

Elizabeth Barnes@BethMayBarnes

Persnickety title would be: "there's an exponential trend with doubling time between ~2 -12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions". More detail on how we thought about external validity in paper and this thread

Megan Kinniment@MKinniment

Elizabeth Barnes

Elizabeth Barnes
@BethMayBarnes

Megan Kinniment
@MKinniment