I agree it’s better than chess. I like the paper FWIW, but don’t think we should conclude that models can just generically do hour-long tasks and this doubles every 7 months (which is something I see people interpreting this paper to be saying).
Agree the result is not ‘AIs can do most tasks that humans can do in 1 hour’.
I do think it should be an update towards capabilities following an exponential trend - at least on benchmarks. (This seems much more robust to e.g. task dist changes than the exact time horizon).