00:02 : METR( Model Evaluation & Threat Research) introduction
00:50 : Question, Answer, Multiple choice dataset.
01:35 : Claude play Pokemon
02:00 : paper, Measuring AI Ability to Complete Long Tasks
03:05 : measure, "how long a task a model can do?"