I agree it’s better than chess. I like the paper FWIW, but don’t think we should conclude that models can just generically do hour-long tasks and this doubles every 7 months (which is something I see people interpreting this paper to be saying).

Agree the result is not ‘AIs can do most tasks that humans can do in 1 hour’. I do think it should be an update towards capabilities following an exponential trend - at least on benchmarks. (This seems much more robust to e.g. task dist changes than the exact time horizon).

Megan Kinniment

@MKinniment

I like agents, human or otherwise. @METR_Evals

Follow on 𝕏

twitter-thread.com/t/1902572540352197020

Megan Kinniment@MKinniment

Tamay Besiroglu@tamaybes

Agree the result is not ‘AIs can do most tasks that humans can do in 1 hour’. I do think it should be an update towards capabilities following an exponential trend - at least on benchmarks. (This seems much more robust to e.g. task dist changes than the exact time horizon).

Megan Kinniment

Megan Kinniment
@MKinniment

Tamay Besiroglu
@tamaybes