cyysky · October 24, 2024 20:11 · cyysky · Oct 24, 2024 · cyysky · Oct 24, 2024
diff --git a/yann_lecun_20241018.srt b/yann_lecun_20241018.srt
 1
 00:00:00,000 --> 00:00:10,480
 So welcome all. Welcome to this distinguished lecture series in AI. I'm Vishal Mishra. I'm the

 2
 00:00:10,480 --> 00:00:15,040
 Vice Dean for Computing and AI in Columbia Engineering. This is the second lecture in our

 3
 00:00:15,040 --> 00:00:20,420
 series. We seem to have a reasonably full house. People are still streaming in. So before we start,

 4
 00:00:20,480 --> 00:00:26,020
 I'd like to invite Dean Shifu Chang to give some opening remarks. All right. Good morning, everyone.

 5
 00:00:26,020 --> 00:00:28,020
 Welcome to our

 6
 00:00:28,020 --> 00:00:34,580
 It's really exciting.

 7
 00:00:35,060 --> 00:00:38,560
 This is the first time I see we have an overflow space used today.

 8
 00:00:38,720 --> 00:00:40,980
 Really so exciting about the topic and speaker.

 9
 00:00:41,540 --> 00:00:46,960
 I want to thank Michelle and the team for organizing the AI lecture series this semester and throughout the year.

 10
 00:00:47,320 --> 00:00:51,560
 I want to thank our president Katrina Armstrong for coming to support our event today.

 11
 00:00:51,560 --> 00:00:59,480
 And as Vichel mentioned, this is the second in our AI lecture series across the school

 12
 00:00:59,480 --> 00:01:03,400
 and is associated with the university initiative in AI.

 13
 00:01:03,580 --> 00:01:07,220
 That's one of the priorities that President Armstrong is leading us

 14
 00:01:07,220 --> 00:01:10,060
 for the school university-wide effort here.

 15
 00:01:10,720 --> 00:01:13,640
 Last month, we launched this new AI lecture series,

 16
 00:01:14,160 --> 00:01:16,620
 starting with our faculty member, Pierre Gentian,

 17
 00:01:16,740 --> 00:01:20,480
 to talk about how AI can have an impact in different disciplines.

 18
 00:01:20,480 --> 00:01:23,860
 And so last month we launched AI and climate projection.

 19
 00:01:23,860 --> 00:01:25,320
 And today we're so excited,

 20
 00:01:25,320 --> 00:01:28,220
 Dr. Yang Li Kang is here to share his vision,

 21
 00:01:28,220 --> 00:01:30,440
 his insight on very exciting topic.

 22
 00:01:30,440 --> 00:01:31,960
 You have seen his title.

 23
 00:01:31,960 --> 00:01:36,960
 I have seen Yang talking many times in CVPR, ICML,

 24
 00:01:36,960 --> 00:01:38,220
 learning representation,

 25
 00:01:38,220 --> 00:01:41,780
 but today's topic is particularly intriguing.

 26
 00:01:41,780 --> 00:01:45,120
 And his presence, as you can see from audience today,

 27
 00:01:45,120 --> 00:01:47,420
 we have to open up overflow space.

 28
 00:01:47,420 --> 00:01:49,040
 The event once is announced,

 29
 00:01:49,040 --> 00:01:50,460
 Three minutes, salt hot.

 30
 00:01:50,460 --> 00:01:52,820
 You are the lucky ones, okay.

 31
 00:01:52,820 --> 00:01:55,700
 And the lecture series, one of the efforts

 32
 00:01:55,700 --> 00:01:59,420
 around AI and university, we are pursuing advances

 33
 00:01:59,420 --> 00:02:02,120
 in the fundamental area, which is covered

 34
 00:02:02,120 --> 00:02:03,560
 by today's lecture.

 35
 00:02:03,560 --> 00:02:06,540
 We're also pursuing the impact in different discipline

 36
 00:02:06,540 --> 00:02:10,940
 in collaboration among all the 17 schools at Columbia.

 37
 00:02:10,940 --> 00:02:14,380
 Climate, business, finance, policy, journalism, you name it.

 38
 00:02:14,380 --> 00:02:16,600
 So we work with industry, community,

 39
 00:02:16,600 --> 00:02:22,340
 create centers on AI and finance, AI on climate, AI on sports, AI and policy.

 40
 00:02:22,740 --> 00:02:23,920
 So that's our effort today.

 41
 00:02:24,060 --> 00:02:30,500
 We create a new course on AI in context to teach AI in the context of humanity, in literature,

 42
 00:02:30,760 --> 00:02:32,340
 in music, and philosophy.

 43
 00:02:32,780 --> 00:02:36,460
 Today's topic, how could machine reach human-level intelligence?

 44
 00:02:36,780 --> 00:02:40,380
 Just by reading the title makes me so intrigued, so excited.

 45
 00:02:40,800 --> 00:02:45,580
 So without further ado, let me invite Vishal, our Vice Dean of AI and Computing,

 46
 00:02:45,580 --> 00:02:47,900
 to have an introduction of our speaker,

 47
 00:02:47,900 --> 00:02:49,240
 Yang Likang, today.

 48
 00:02:49,240 --> 00:02:50,080
 He's here.

 49
 00:02:52,820 --> 00:02:53,660
 Thanks, Shifu.

 50
 00:02:55,940 --> 00:02:58,360
 So, Yang, of course, needs no introduction.

 51
 00:03:03,480 --> 00:03:04,660
 But just to embarrass him,

 52
 00:03:04,660 --> 00:03:08,140
 I'll give a brief introduction of Yang.

 53
 00:03:08,140 --> 00:03:11,360
 Now, this may come as a surprise to a lot of you,

 54
 00:03:11,360 --> 00:03:13,280
 but it's true,

 55
 00:03:13,280 --> 00:03:15,620
 and you'll never guess it from his accent.

 56
 00:03:15,620 --> 00:03:17,000
 Jan is actually French.

 57
 00:03:18,080 --> 00:03:22,680
 He got his PhD from the Sorbonne in 1987,

 58
 00:03:22,680 --> 00:03:24,280
 and in his PhD thesis,

 59
 00:03:24,280 --> 00:03:28,020
 he proposed an early form of back propagation.

 60
 00:03:28,020 --> 00:03:29,600
 Now back propagation is the way

 61
 00:03:29,600 --> 00:03:32,540
 all neural networks are trained now,

 62
 00:03:32,540 --> 00:03:36,440
 and it sort of started from his PhD thesis.

 63
 00:03:37,620 --> 00:03:41,460
 He joined 8080 Bell Labs in 1988.

 64
 00:03:41,460 --> 00:03:44,020
 Before that, he spent a few months or a year

 65
 00:03:44,020 --> 00:03:46,920
 with Jeff Hinton working as a postdoc.

 66
 00:03:49,840 --> 00:03:52,180
 I, there was an alarm, okay.

 67
 00:03:52,180 --> 00:03:54,620
 And he joined AT&T Bell Labs in 1988.

 68
 00:03:55,860 --> 00:03:58,040
 Next year, he sort of stunned the world

 69
 00:03:58,040 --> 00:04:00,160
 with this handwriting recognition system.

 70
 00:04:00,160 --> 00:04:01,460
 And you'll see a video of that.

 71
 00:04:11,460 --> 00:04:38,300
 .

 72
 00:04:38,300 --> 00:04:40,500
 This was absolutely incredible at that time.

 73
 00:04:45,300 --> 00:04:47,900
 And there you see Jan looking slightly different.

 74
 00:05:05,420 --> 00:05:08,120
 After that came a long AI and neural nets

 75
 00:05:08,120 --> 00:05:13,120
 Winter, Jan joined AT&T Research in 1996,

 76
 00:05:14,320 --> 00:05:15,400
 but he never gave up.

 77
 00:05:15,400 --> 00:05:19,760
 He continued working on convolutional neural network CNNs,

 78
 00:05:19,760 --> 00:05:23,960
 which were what he used for the handwriting recognition system.

 79
 00:05:23,960 --> 00:05:27,840
 Around 2012, the deep learning revolution happened,

 80
 00:05:27,840 --> 00:05:29,680
 and now CNNs are everywhere,

 81
 00:05:29,680 --> 00:05:31,680
 whether his friend Elon Musk's cars,

 82
 00:05:33,580 --> 00:05:35,460
 some people got what I meant,

 83
 00:05:35,460 --> 00:05:40,460
 or Google Photos, everyone uses CNNs.

 84
 00:05:42,300 --> 00:05:47,300
 In 2013, Jan joined Meta AI as the director of their AI lab

 85
 00:05:48,000 --> 00:05:49,940
 and now he is the chief scientist.

 86
 00:05:49,940 --> 00:05:52,520
 In 2018, he also won the Turing Award

 87
 00:05:52,520 --> 00:05:54,840
 along with Jeff Hinton and Yoshua Bengio

 88
 00:05:56,220 --> 00:06:00,220
 for his work in deep learning and artificial intelligence.

 89
 00:06:00,220 --> 00:06:02,560
 In fact, Jeff was here yesterday.

 90
 00:06:02,560 --> 00:06:04,580
 He was on campus and he was walking around

 91
 00:06:04,580 --> 00:06:06,180
 And people were asking him for selfies.

 92
 00:06:06,180 --> 00:06:08,980
 So he wanted to be here.

 93
 00:06:08,980 --> 00:06:10,660
 Unfortunately, something urgent came up,

 94
 00:06:10,660 --> 00:06:12,200
 so he couldn't be here.

 95
 00:06:12,200 --> 00:06:16,940
 So as I mentioned, Jan won the Turing Award in 2013, or 2018.

 96
 00:06:16,940 --> 00:06:19,860
 And this is a Turing Award for computer science,

 97
 00:06:19,860 --> 00:06:22,580
 not for physics or chemistry, which are also known as Nobel

 98
 00:06:22,580 --> 00:06:24,980
 prizes these days.

 99
 00:06:24,980 --> 00:06:27,080
 This is the original one.

 100
 00:06:27,080 --> 00:06:28,580
 And he won the award in 2018.

 101
 00:06:28,580 --> 00:06:33,220
 And he's also big into the selfie game.

 102
 00:06:33,220 --> 00:06:34,820
 I took a selfie with him that day.

 103
 00:06:36,640 --> 00:06:39,080
 And now with that, I'll invite Jan to tell us

 104
 00:06:39,080 --> 00:06:40,600
 about human level intelligence.

 105
 00:06:48,720 --> 00:06:52,180
 Thank you very much for this amazing introduction.

 106
 00:06:54,180 --> 00:06:56,740
 A real pleasure to be here.

 107
 00:06:56,740 --> 00:06:59,620
 The good thing to come give a talk here is that

 108
 00:07:00,740 --> 00:07:02,000
 I didn't have to fly.

 109
 00:07:02,000 --> 00:07:08,180
 Although if you ask people from downtown, they rarely go above 23rd Street.

 110
 00:07:11,700 --> 00:07:18,540
 So, yeah, I mean, I worked really hard to lose my French accent in the last four decades or so,

 111
 00:07:18,680 --> 00:07:23,660
 three and a half decades. But I just recently learned that if you speak English with a French

 112
 00:07:32,000 --> 00:07:35,320
 I should speak with a very strong French accent.

 113
 00:07:36,120 --> 00:07:40,600
 And perhaps, appear intelligent.

 114
 00:07:40,600 --> 00:07:46,800
 Okay. What should appear intelligent is machines,

 115
 00:07:46,800 --> 00:07:49,320
 and they do appear intelligent.

 116
 00:07:49,320 --> 00:07:52,600
 We, a lot of people give them IQ,

 117
 00:07:52,600 --> 00:07:53,640
 whatever that means,

 118
 00:07:53,640 --> 00:07:56,520
 that is actually much higher than they deserve.

 119
 00:07:56,520 --> 00:08:00,160
 We are nowhere near being able to reach

 120
 00:08:00,160 --> 00:08:03,520
 human intelligence or human level intelligence with machines,

 121
 00:08:03,520 --> 00:08:05,780
 what some people call AGI,

 122
 00:08:05,780 --> 00:08:07,800
 Artificial General Intelligence.

 123
 00:08:07,800 --> 00:08:09,660
 I hate that term.

 124
 00:08:09,660 --> 00:08:13,040
 I've been trying to fight against it.

 125
 00:08:13,040 --> 00:08:16,600
 The reason is not that it's impossible for

 126
 00:08:16,600 --> 00:08:17,880
 a machine to reach human intelligence.

 127
 00:08:17,880 --> 00:08:18,720
 Of course, it's possible.

 128
 00:08:18,720 --> 00:08:20,720
 There's no question at some point we'll have

 129
 00:08:20,720 --> 00:08:23,300
 machines that are as intelligent as humans in

 130
 00:08:23,300 --> 00:08:25,080
 all the domains where humans are intelligent.

 131
 00:08:25,080 --> 00:08:27,780
 There's no question that they will go beyond this.

 132
 00:08:27,780 --> 00:08:32,480
 But it's just because human intelligence is not general at all.

 133
 00:08:32,480 --> 00:08:34,960
 We are very specialized animals.

 134
 00:08:34,960 --> 00:08:41,220
 We have a hard time imagining that we are specialized because all the problems

 135
 00:08:41,220 --> 00:08:48,080
 that we can fathom or imagine are problems that we can fathom or imagine.

 136
 00:08:48,080 --> 00:08:54,940
 But there is many, many more problems that we can't even imagine in our world's dream.

 137
 00:08:54,940 --> 00:08:59,500
 and so it makes us appear generally intelligent.

 138
 00:08:59,500 --> 00:09:01,760
 We're not. We're specialized.

 139
 00:09:01,760 --> 00:09:03,520
 So we should lose that term,

 140
 00:09:03,520 --> 00:09:05,300
 artificial general intelligence.

 141
 00:09:05,300 --> 00:09:08,980
 I prefer the term human level intelligence or a code name

 142
 00:09:08,980 --> 00:09:15,480
 that we've adopted inside Meta is an acronym AMI,

 143
 00:09:15,480 --> 00:09:18,620
 which means Advanced Machine Intelligence,

 144
 00:09:18,620 --> 00:09:21,220
 which is kind of a little more loose.

 145
 00:09:21,220 --> 00:09:24,020
 Also, we pronounce it AMI.

 146
 00:09:24,020 --> 00:09:27,900
 Which in French means friend.

 147
 00:09:28,140 --> 00:09:30,340
 Makes sense.

 148
 00:09:30,340 --> 00:09:33,380
 Okay. So how can we ever reach

 149
 00:09:33,380 --> 00:09:35,260
 human level intelligence with machines?

 150
 00:09:35,260 --> 00:09:37,940
 Machines that can learn, of course,

 151
 00:09:37,940 --> 00:09:40,220
 can remember, understand the physical world,

 152
 00:09:40,220 --> 00:09:43,140
 have common sense, can plan, can reason,

 153
 00:09:43,140 --> 00:09:46,020
 are behaving properly,

 154
 00:09:46,020 --> 00:09:50,500
 not being unruly, dangerous, etc.

 155
 00:09:50,500 --> 00:09:52,940
 And the first question we should ask ourselves is,

 156
 00:09:52,940 --> 00:09:54,620
 Why would we want to build this?

 157
 00:09:54,620 --> 00:09:57,260
 So obviously there is a big scientific question of what is

 158
 00:09:57,260 --> 00:09:59,580
 intelligence and the best way to

 159
 00:09:59,580 --> 00:10:04,060
 validate any theory we have about intelligence is to

 160
 00:10:04,060 --> 00:10:07,500
 build an artifact that actually implements it.

 161
 00:10:07,500 --> 00:10:11,500
 That's a very engineering approach to science if you want.

 162
 00:10:11,500 --> 00:10:14,700
 But there is another good reason and the other good reason is that

 163
 00:10:14,700 --> 00:10:20,140
 we need human level intelligence to amplify human intelligence.

 164
 00:10:20,140 --> 00:10:24,620
 There's going to be a future in which we run

 165
 00:10:24,620 --> 00:10:29,700
 around with AI assistant with us at all times,

 166
 00:10:29,700 --> 00:10:32,460
 so we can ask any question from them.

 167
 00:10:32,460 --> 00:10:34,280
 They can answer any question we have.

 168
 00:10:34,280 --> 00:10:35,680
 They can help us in our daily lives.

 169
 00:10:35,680 --> 00:10:38,100
 They can solve problems for us.

 170
 00:10:38,100 --> 00:10:40,060
 This will amplify human intelligence,

 171
 00:10:40,060 --> 00:10:42,100
 perhaps in the way that the printing press has

 172
 00:10:42,100 --> 00:10:45,320
 amplified human intelligence in the 15th century.

 173
 00:10:45,320 --> 00:10:49,420
 So we need this for humanity.

 174
 00:10:49,420 --> 00:10:53,780
 In fact, I'm wearing a pair of smart glasses right now.

 175
 00:10:53,780 --> 00:10:56,540
 I can ask it questions.

 176
 00:10:56,540 --> 00:10:57,660
 It goes through Meta AI,

 177
 00:10:57,660 --> 00:10:59,500
 which is the product version of

 178
 00:10:59,500 --> 00:11:02,060
 LAMA 3 that many of you have heard of.

 179
 00:11:02,060 --> 00:11:04,340
 I can ask you various things.

 180
 00:11:04,340 --> 00:11:06,780
 So let me ask you something.

 181
 00:11:06,780 --> 00:11:09,060
 I'm not going to use the microphone.

 182
 00:11:09,060 --> 00:11:13,780
 Hey, Meta. Take a picture.

 183
 00:11:13,780 --> 00:11:16,700
 You see that little light flash?

 184
 00:11:16,700 --> 00:11:19,420
 Okay, you're all on picture.

 185
 00:11:19,600 --> 00:11:21,900
 You'll be on social network soon.

 186
 00:11:26,000 --> 00:11:28,660
 So, you know, I could ask it, you know,

 187
 00:11:28,720 --> 00:11:30,020
 more complex questions, obviously.

 188
 00:11:31,060 --> 00:11:36,720
 And this thing can also recognize through the camera.

 189
 00:11:36,840 --> 00:11:39,500
 So you can ask it, what am I looking at?

 190
 00:11:39,860 --> 00:11:41,180
 What is the species of plant?

 191
 00:11:42,340 --> 00:11:45,120
 You know, you can look at a menu in Japanese

 192
 00:11:45,120 --> 00:11:46,340
 and it will translate it for you.

 193
 00:11:46,340 --> 00:11:49,400
 So, you know, this kind of assistance are coming.

 194
 00:11:49,400 --> 00:11:51,020
 They're still pretty stupid,

 195
 00:11:51,020 --> 00:11:53,680
 but they're already useful.

 196
 00:11:53,680 --> 00:11:56,080
 But there is a future maybe,

 197
 00:11:56,080 --> 00:11:57,880
 you know, 10, 20 years from now,

 198
 00:11:57,880 --> 00:12:00,100
 where they will be really smart and they will

 199
 00:12:00,100 --> 00:12:01,240
 assist us in their daily lives.

 200
 00:12:01,240 --> 00:12:03,900
 So we need those systems to have human level intelligence,

 201
 00:12:03,900 --> 00:12:05,940
 because that's the best way for them to not be

 202
 00:12:05,940 --> 00:12:08,500
 frustrating for us to interact with.

 203
 00:12:08,500 --> 00:12:10,340
 Okay. So on the one hand,

 204
 00:12:10,340 --> 00:12:12,720
 there is the really interesting scientific question

 205
 00:12:12,720 --> 00:12:14,800
 of what is intelligence.

 206
 00:12:14,800 --> 00:12:18,560
 In the middle there is the technological challenge

 207
 00:12:18,560 --> 00:12:20,480
 of building intelligent machines.

 208
 00:12:20,480 --> 00:12:22,980
 Then at the other end, it's actually useful.

 209
 00:12:22,980 --> 00:12:26,720
 It will actually be useful for people and for humanity more generally.

 210
 00:12:26,720 --> 00:12:30,660
 So all of the conditions.

 211
 00:12:30,660 --> 00:12:34,640
 Then the more important condition is that there are people with

 212
 00:12:34,640 --> 00:12:41,820
 a lot of resources willing to actually invest for this to be true, like Meta.

 213
 00:12:41,820 --> 00:12:52,400
 So, the characteristics that we want of those machines is that they need to be able to understand the physical world.

 214
 00:12:52,660 --> 00:12:55,520
 Current AI systems do not understand the physical world.

 215
 00:12:57,560 --> 00:13:01,900
 They don't understand the physical world nearly as well as your house cat.

 216
 00:13:03,520 --> 00:13:07,320
 And so, I've been saying, you know, and of course, newspapers can have like this kind of title.

 217
 00:13:07,320 --> 00:13:11,240
 You know, Yannick says AI is stupider than a cat.

 218
 00:13:11,240 --> 00:13:15,540
 It's true, actually.

 219
 00:13:15,540 --> 00:13:19,400
 We need AI systems that have persistent memory.

 220
 00:13:19,400 --> 00:13:22,940
 We need them to be able to plan complex action sequences,

 221
 00:13:22,940 --> 00:13:25,660
 which current systems are completely incapable of doing.

 222
 00:13:25,660 --> 00:13:27,760
 We need them to be able to reason,

 223
 00:13:27,760 --> 00:13:29,740
 and we need them to be controllable and safe.

 224
 00:13:29,740 --> 00:13:32,140
 So basically, and by design,

 225
 00:13:32,140 --> 00:13:35,580
 not by fine-tuning like it's done at the moment.

 226
 00:13:37,040 --> 00:13:40,740
 That requires essentially new principles that are

 227
 00:13:40,740 --> 00:13:44,980
 different from what current AI systems really are based on.

 228
 00:13:44,980 --> 00:13:48,980
 So current systems, most of them anyway,

 229
 00:13:48,980 --> 00:13:51,800
 perform inference by propagating signals through

 230
 00:13:51,800 --> 00:13:54,100
 a bunch of layers of a neural net.

 231
 00:13:54,100 --> 00:13:58,640
 I'm a big fan of that obviously, but it's very limited.

 232
 00:13:58,640 --> 00:14:03,260
 There's only a small number of input-output functions that can

 233
 00:14:03,260 --> 00:14:06,100
 be efficiently represented by feed-forward

 234
 00:14:06,100 --> 00:14:09,980
 propagation through a bunch of layers in a neural net.

 235
 00:14:09,980 --> 00:14:13,300
 There's a much more general approach to inference,

 236
 00:14:13,300 --> 00:14:17,140
 which is not just running feed forward to a bunch of layers,

 237
 00:14:17,140 --> 00:14:19,480
 but is based on optimization.

 238
 00:14:19,480 --> 00:14:22,900
 So basically, there's an observation.

 239
 00:14:22,900 --> 00:14:28,780
 You give the system a proposal for an output,

 240
 00:14:28,780 --> 00:14:31,380
 and the system tells you to what extent

 241
 00:14:31,380 --> 00:14:34,340
 the output is compatible with the observation.

 242
 00:14:34,340 --> 00:14:37,500
 Okay. So give you a picture of an elephant.

 243
 00:14:37,500 --> 00:14:42,040
 I put the representation of the label elephant or the text,

 244
 00:14:42,040 --> 00:14:43,120
 and the system tells you,

 245
 00:14:43,120 --> 00:14:45,060
 yeah, those two things are compatible.

 246
 00:14:45,060 --> 00:14:49,600
 The label elephant is a good label for that image.

 247
 00:14:49,600 --> 00:14:51,620
 If you put the picture of a table,

 248
 00:14:51,620 --> 00:14:53,860
 it says no, it's incompatible.

 249
 00:14:53,860 --> 00:14:56,980
 So if you have a system that basically measures

 250
 00:14:56,980 --> 00:15:00,020
 the compatibility between an input and an output,

 251
 00:15:00,020 --> 00:15:02,440
 then through optimization and search,

 252
 00:15:02,440 --> 00:15:06,440
 you can find an output that is most compatible with the input.

 253
 00:15:06,440 --> 00:15:10,100
 This is intrinsically more powerful as an inference mechanism

 254
 00:15:10,100 --> 00:15:13,440
 than just running feed forward through a bunch of layers.

 255
 00:15:13,440 --> 00:15:16,720
 Because basically, any computational problem

 256
 00:15:16,720 --> 00:15:19,260
 can be reduced to an optimization problem.

 257
 00:15:19,260 --> 00:15:23,460
 So that's the very basic principle on

 258
 00:15:23,460 --> 00:15:25,720
 which future AI system should be built.

 259
 00:15:25,720 --> 00:15:27,940
 Not propagating through a bunch of layers,

 260
 00:15:27,940 --> 00:15:30,040
 but optimizing the answer so that

 261
 00:15:30,040 --> 00:15:31,680
 it's most compatible with the input.

 262
 00:15:31,680 --> 00:15:34,440
 Of course, this will involve deep learning system,

 263
 00:15:34,440 --> 00:15:36,160
 back propagation, all that stuff.

 264
 00:15:36,160 --> 00:15:38,880
 But the inference mechanism is very different.

 265
 00:15:38,880 --> 00:15:41,700
 Now, this is not a new idea by all means.

 266
 00:15:41,700 --> 00:15:44,060
 This type of inference is what is

 267
 00:15:44,060 --> 00:15:46,220
 very standard in probabilistic inference.

 268
 00:15:46,220 --> 00:15:47,700
 For example, if you have a graphical model,

 269
 00:15:47,700 --> 00:15:50,820
 Bayesian network, you know the value of certain variables,

 270
 00:15:50,820 --> 00:15:53,340
 you can infer the value of the other variables by

 271
 00:15:53,340 --> 00:15:56,400
 minimizing a negative log likelihood or something like that,

 272
 00:15:56,400 --> 00:15:58,580
 or with some energy function.

 273
 00:15:58,580 --> 00:16:01,180
 So it's a very standard thing to do.

 274
 00:16:01,180 --> 00:16:02,780
 There's nothing innovative about this,

 275
 00:16:02,780 --> 00:16:05,340
 but people have forgotten about the fact that this is

 276
 00:16:05,340 --> 00:16:08,540
 really much more powerful than feed-forward propagation.

 277
 00:16:08,540 --> 00:16:13,200
 The framework that I like to explain this is called energy-based model.

 278
 00:16:13,200 --> 00:16:17,460
 So basically, the function that measures the compatibility between X and Y,

 279
 00:16:17,460 --> 00:16:20,400
 input and output, is an energy function that takes

 280
 00:16:20,400 --> 00:16:25,700
 low values when input and output are compatible and larger values when they're not.

 281
 00:16:29,000 --> 00:16:33,600
 So the type of inference that can take place to find

 282
 00:16:33,600 --> 00:16:35,840
 the output could be a number of different things.

 283
 00:16:35,840 --> 00:16:41,480
 If the representation of the output is continuous,

 284
 00:16:41,480 --> 00:16:43,460
 and if the modules that we're talking about,

 285
 00:16:43,460 --> 00:16:45,620
 the objectives, all the modules

 286
 00:16:45,620 --> 00:16:47,380
 inside of the system are differentiable,

 287
 00:16:47,380 --> 00:16:49,820
 you can use gradient-based optimization to find

 288
 00:16:49,820 --> 00:16:53,360
 the best one good answer.

 289
 00:16:53,360 --> 00:16:56,780
 But you can imagine that the output is discrete,

 290
 00:16:56,780 --> 00:16:58,340
 combinatorial, and then you have to use

 291
 00:16:58,340 --> 00:17:02,500
 other types of combinatorial optimization algorithms

 292
 00:17:02,500 --> 00:17:06,900
 to figure out the best output.

 293
 00:17:06,900 --> 00:17:07,960
 If that's the case,

 294
 00:17:07,960 --> 00:17:12,280
 then you're talking to the wrong LeCun,

 295
 00:17:12,280 --> 00:17:14,960
 because my brother is actually,

 296
 00:17:14,960 --> 00:17:16,620
 he works at Google, nobody's perfect,

 297
 00:17:16,620 --> 00:17:19,980
 but he works on,

 298
 00:17:19,980 --> 00:17:22,360
 he's an expert in combinatorial optimization.

 299
 00:17:25,680 --> 00:17:29,260
 So this type of inference gives AI systems

 300
 00:17:29,260 --> 00:17:31,300
 kind of zero-shot learning ability.

 301
 00:17:31,300 --> 00:17:31,960
 What does that mean?

 302
 00:17:31,960 --> 00:17:34,860
 It means you give them a problem and if they can,

 303
 00:17:34,860 --> 00:17:36,900
 if you can't formulate this problem in terms of

 304
 00:17:36,900 --> 00:17:38,880
 the optimization problem then you get a solution to

 305
 00:17:38,880 --> 00:17:42,020
 that problem without the system having to learn anything.

 306
 00:17:42,020 --> 00:17:43,900
 Right? That's your shot.

 307
 00:17:43,900 --> 00:17:46,120
 You are given, and you are students,

 308
 00:17:46,120 --> 00:17:49,320
 you're given a new mathematics problem, something.

 309
 00:17:49,320 --> 00:17:52,320
 You can think about it and perhaps

 310
 00:17:52,320 --> 00:17:55,460
 solve it without learning anything new.

 311
 00:17:55,460 --> 00:17:59,460
 Right? That's called zero shot scale.

 312
 00:17:59,460 --> 00:18:05,240
 and in humans some psychologists also call this system two.

 313
 00:18:05,240 --> 00:18:10,320
 So basically you devote your entire attention and consciousness to

 314
 00:18:10,320 --> 00:18:13,740
 solving a problem that you concentrate on and you think about it and it might

 315
 00:18:13,740 --> 00:18:16,840
 take a long time to solve that problem.

 316
 00:18:16,840 --> 00:18:17,980
 That's system two.

 317
 00:18:17,980 --> 00:18:22,220
 System one is when you act reactively.

 318
 00:18:22,220 --> 00:18:23,200
 You don't have to think about it,

 319
 00:18:23,200 --> 00:18:25,360
 it's become kind of subconscious, automatic.

 320
 00:18:25,360 --> 00:18:27,140
 So if you are an experienced driver,

 321
 00:18:27,140 --> 00:18:28,360
 you drive on the highway,

 322
 00:18:28,360 --> 00:18:29,380
 you don't have to think about it.

 323
 00:18:29,380 --> 00:18:30,780
 it's going to become automatic.

 324
 00:18:30,780 --> 00:18:34,880
 You can hold a conversation with someone and everything.

 325
 00:18:34,880 --> 00:18:37,520
 If you're a beginner though,

 326
 00:18:37,520 --> 00:18:39,980
 it's your first time driving a car,

 327
 00:18:39,980 --> 00:18:41,920
 you pay close attention.

 328
 00:18:41,920 --> 00:18:43,260
 You're using your system to

 329
 00:18:43,260 --> 00:18:48,320
 your entire capacity of your mind.

 330
 00:18:49,140 --> 00:18:53,520
 So that's why we need to adopt this model.

 331
 00:18:53,520 --> 00:18:56,600
 This framework of energy-based model is

 332
 00:18:56,600 --> 00:18:59,680
 sort of the way to understand this at the theoretical level.

 333
 00:18:59,680 --> 00:19:01,420
 I'm not gonna do a lot of theory here.

 334
 00:19:01,420 --> 00:19:03,300
 This is a very diverse audience,

 335
 00:19:03,300 --> 00:19:05,300
 but the basic idea is that,

 336
 00:19:05,300 --> 00:19:06,920
 if you have two variables, X and Y,

 337
 00:19:06,920 --> 00:19:08,040
 here there are scalars,

 338
 00:19:08,040 --> 00:19:12,380
 but you can imagine that they are high dimensional inputs.

 339
 00:19:12,380 --> 00:19:16,800
 The energy function is some sort of landscape

 340
 00:19:16,800 --> 00:19:20,800
 where pairs of X and Y that are compatible

 341
 00:19:20,800 --> 00:19:23,500
 have low energy and then low altitude if you want,

 342
 00:19:23,500 --> 00:19:25,920
 and then pairs of X and Y's that are not compatible

 343
 00:19:25,920 --> 00:19:27,280
 of higher energy.

 344
 00:19:27,280 --> 00:19:30,260
 And so the goal of learning now is to shape

 345
 00:19:30,260 --> 00:19:32,880
 this energy surface in such a way that it gives

 346
 00:19:32,880 --> 00:19:35,360
 low energy to things you observe,

 347
 00:19:35,360 --> 00:19:38,880
 training data, pairs of XY that you observe,

 348
 00:19:38,880 --> 00:19:41,520
 and then higher energy to everything else.

 349
 00:19:41,520 --> 00:19:43,400
 The first part is super easy

 350
 00:19:43,400 --> 00:19:44,860
 because we know how to do gradient descent.

 351
 00:19:44,860 --> 00:19:48,760
 So you give a pair of XY that you know are compatible

 352
 00:19:48,760 --> 00:19:51,860
 and you tweak the system so that the scalar output,

 353
 00:19:51,860 --> 00:19:55,840
 the energy, the scalar energy output that it produces

 354
 00:19:55,840 --> 00:20:00,000
 You can tweak the parameters inside your big neural net so that the output goes down.

 355
 00:20:00,000 --> 00:20:06,240
 Easy. The difficulty is how to make sure that the energy is higher outside of the training sample.

 356
 00:20:06,240 --> 00:20:10,080
 The training samples in this diagram are represented by the black dots.

 357
 00:20:12,720 --> 00:20:18,480
 And at some level, a lot of literature in machine learning is devoted to that problem.

 358
 00:20:18,480 --> 00:20:23,840
 It's not formulated in the way I just did, but it's in probably a framework for example,

 359
 00:20:23,840 --> 00:20:31,060
 This problem of making sure the energy of things outside the training data is high,

 360
 00:20:31,060 --> 00:20:33,240
 is a major issue.

 361
 00:20:33,240 --> 00:20:40,280
 It usually encounters intractable mathematical problems.

 362
 00:20:40,280 --> 00:20:42,160
 Let me skip this for now.

 363
 00:20:42,160 --> 00:20:47,880
 Okay. So now, the whole craze of AI over the last couple of years,

 364
 00:20:47,880 --> 00:20:50,880
 three years let's say, has been around LLMs,

 365
 00:20:50,880 --> 00:20:53,320
 Large language models and large language models should be

 366
 00:20:53,320 --> 00:20:56,200
 really called auto-regressive large language models.

 367
 00:20:56,200 --> 00:21:00,660
 So what they do is they're trained on lots of texts and they're

 368
 00:21:00,660 --> 00:21:03,900
 basically trained to produce the next word,

 369
 00:21:03,900 --> 00:21:08,600
 to predict the next word from a sequence of words that preceded.

 370
 00:21:09,640 --> 00:21:14,360
 That's all they've been trained to do.

 371
 00:21:14,840 --> 00:21:17,680
 Once the system has been trained,

 372
 00:21:17,680 --> 00:21:20,620
 you can of course show it a piece of text and then ask

 373
 00:21:20,620 --> 00:21:23,440
 to predict the next word and then you inject that next word into

 374
 00:21:23,440 --> 00:21:26,080
 the input and ask you to predict the second next word,

 375
 00:21:26,080 --> 00:21:27,780
 shift that into the input,

 376
 00:21:27,780 --> 00:21:29,060
 third word, etc.

 377
 00:21:29,060 --> 00:21:30,620
 So that's auto-regressive prediction.

 378
 00:21:30,620 --> 00:21:36,180
 It's not a new concept that's been around for before I was born.

 379
 00:21:36,180 --> 00:21:39,000
 So not recent.

 380
 00:21:39,000 --> 00:21:41,400
 But it's system one.

 381
 00:21:41,400 --> 00:21:44,400
 It's feed forward propagation through a bunch of layers.

 382
 00:21:44,400 --> 00:21:46,300
 There is a fixed amount of

 383
 00:21:46,300 --> 00:21:50,240
 computation devoted to computing every new token.

 384
 00:21:50,240 --> 00:21:56,280
 So if you want a system to spend more resources producing an answer,

 385
 00:21:56,280 --> 00:21:57,540
 a system of this type,

 386
 00:21:57,540 --> 00:22:01,960
 you basically have to artificially make it produce more tokens,

 387
 00:22:01,960 --> 00:22:03,640
 which seems kind of a hack.

 388
 00:22:03,640 --> 00:22:05,400
 That's called chain of thought.

 389
 00:22:05,400 --> 00:22:13,260
 There's various techniques to do approximate planning or reasoning using this.

 390
 00:22:13,260 --> 00:22:18,200
 You basically have the system produce lots and lots of candidate outputs by

 391
 00:22:18,200 --> 00:22:23,920
 kind of changing the noise in the way it produces the sequences and then within

 392
 00:22:23,920 --> 00:22:28,140
 the list of outputs that it produces you search for a good one essentially.

 393
 00:22:28,140 --> 00:22:32,080
 So there's a little bit of search there, a little bit of optimization but it's

 394
 00:22:32,080 --> 00:22:37,580
 kind of a hack. So I don't believe those methods will ever lead to true

 395
 00:22:37,580 --> 00:22:44,840
 intelligent behavior. In fact cognitive scientists agree. Cognitive scientists

 396
 00:22:44,840 --> 00:22:50,540
 I've been looking at LLMs with a very critical eye and saying that this is not real intelligence.

 397
 00:22:50,540 --> 00:22:53,640
 This is nothing like what we observe in people.

 398
 00:22:53,640 --> 00:22:59,840
 Similarly, people coming from kind of the non-machine learning based AI community,

 399
 00:22:59,840 --> 00:23:03,240
 people like Subarro Kambampati from Arizona State,

 400
 00:23:03,240 --> 00:23:05,740
 I've been saying LLMs really cannot plan.

 401
 00:23:05,740 --> 00:23:09,340
 So Rao has a whole bunch of papers.

 402
 00:23:14,840 --> 00:23:20,400
 talk about the titles of those papers as LLMs can't plan,

 403
 00:23:20,400 --> 00:23:22,720
 LLMs still can't plan,

 404
 00:23:22,720 --> 00:23:25,680
 LLMs really, really can't plan,

 405
 00:23:25,680 --> 00:23:31,080
 and even LLMs that claim to be able to plan can't actually plan.

 406
 00:23:31,080 --> 00:23:37,340
 So we have a big problem there that the people who claim

 407
 00:23:37,340 --> 00:23:40,120
 that somehow we're going to take the current paradigm,

 408
 00:23:40,120 --> 00:23:44,420
 make it bigger, spend trillions on data centers,

 409
 00:23:44,420 --> 00:23:48,280
 and collect every piece of data in the world and train

 410
 00:23:48,280 --> 00:23:50,940
 LLMs and they're going to reach human level intelligence.

 411
 00:23:50,940 --> 00:23:53,340
 That's completely false in my opinion.

 412
 00:23:53,340 --> 00:23:54,780
 I might be wrong,

 413
 00:23:54,780 --> 00:23:58,140
 but in my opinion, that's completely hopeless.

 414
 00:23:58,140 --> 00:24:01,180
 So the question is, what is not hopeless?

 415
 00:24:01,180 --> 00:24:07,720
 So if we agree to this basic principle of inference to optimization,

 416
 00:24:07,720 --> 00:24:12,700
 how can we sort of instantiate this in

 417
 00:24:12,700 --> 00:24:15,000
 a real intelligent system.

 418
 00:24:15,000 --> 00:24:18,100
 Basically, doing a little bit of introspection,

 419
 00:24:18,100 --> 00:24:21,180
 when we think, the way we think is generally

 420
 00:24:21,180 --> 00:24:24,060
 independent of the language that we might be able to

 421
 00:24:24,060 --> 00:24:26,220
 express this thought in.

 422
 00:24:26,220 --> 00:24:29,140
 I'm thinking about saying things here and it's

 423
 00:24:29,140 --> 00:24:31,660
 independent of whether I'm giving

 424
 00:24:31,660 --> 00:24:33,900
 this talk in English or French.

 425
 00:24:33,900 --> 00:24:37,940
 So there is a thought that is independent of language,

 426
 00:24:37,940 --> 00:24:41,140
 and LLMs don't have this capacity really.

 427
 00:24:41,140 --> 00:24:45,140
 When we think we have a mental model of the situation that we think of.

 428
 00:24:45,140 --> 00:24:47,900
 We're planning a sequence of actions.

 429
 00:24:47,900 --> 00:24:52,020
 We have a mental model that allows us to predict

 430
 00:24:52,020 --> 00:24:54,660
 what the consequences of our actions are going to be,

 431
 00:24:54,660 --> 00:24:57,260
 so that if we set a goal for ourselves,

 432
 00:24:57,260 --> 00:25:02,100
 we can figure out a sequence of actions that will satisfy this goal.

 433
 00:25:02,100 --> 00:25:07,680
 So, association of the model I talked about earlier is one like this,

 434
 00:25:07,680 --> 00:25:11,240
 where you observe the world through a perception module.

 435
 00:25:11,240 --> 00:25:12,800
 Think of it as a big neural net.

 436
 00:25:12,800 --> 00:25:15,800
 It gives you some idea of the current state of the world.

 437
 00:25:15,800 --> 00:25:17,140
 Now, of course, the current state of the world

 438
 00:25:17,140 --> 00:25:18,720
 is whatever you can perceive,

 439
 00:25:18,720 --> 00:25:20,080
 but your idea of the state of the world

 440
 00:25:20,080 --> 00:25:23,920
 also contains stuff that you perceived in the past,

 441
 00:25:23,920 --> 00:25:27,460
 stuff that you know, facts that you know about the world.

 442
 00:25:27,460 --> 00:25:31,480
 So if I take this bottle of water

 443
 00:25:31,480 --> 00:25:35,380
 and I move it from this side to that side of the lectern,

 444
 00:25:35,380 --> 00:25:40,460
 Your model of the world hasn't changed much.

 445
 00:25:40,460 --> 00:25:45,020
 Most of your ideas about the state of the world hasn't changed.

 446
 00:25:45,020 --> 00:25:50,420
 What has changed is the content of this lectern and the position of that box.

 447
 00:25:50,420 --> 00:25:53,060
 But other than that, not much.

 448
 00:25:53,060 --> 00:25:57,580
 So the idea that somehow a perception gives you

 449
 00:25:57,580 --> 00:25:59,900
 a complete picture of the state of the world is false.

 450
 00:25:59,900 --> 00:26:02,060
 You need to combine this with a memory.

 451
 00:26:02,060 --> 00:26:04,260
 So that's this memory module here.

 452
 00:26:04,260 --> 00:26:08,620
 Combine your current perception with the content of your memory.

 453
 00:26:08,620 --> 00:26:11,200
 That gives you an idea of the current state of the world.

 454
 00:26:11,200 --> 00:26:14,940
 Now, what you're going to do is feed this to a world model,

 455
 00:26:14,940 --> 00:26:19,440
 and you're going to hear that phrase many times in the rest of the talk.

 456
 00:26:19,440 --> 00:26:22,560
 The role of this world model is to predict what

 457
 00:26:22,560 --> 00:26:25,220
 the outcome of a sequence of actions is going to be.

 458
 00:26:25,220 --> 00:26:27,340
 This could be actions that you're planning to take,

 459
 00:26:27,340 --> 00:26:29,540
 or this could be the agent is planning to take,

 460
 00:26:29,540 --> 00:26:31,980
 or actions that someone else may be taking,

 461
 00:26:31,980 --> 00:26:34,240
 or some events that may be occurring.

 462
 00:26:34,240 --> 00:26:37,080
 So predicting the outcome of

 463
 00:26:37,080 --> 00:26:40,920
 a sequence of actions is what allows us to reason and plan.

 464
 00:26:41,800 --> 00:26:48,000
 So you can probably tell that if I take this water bottle

 465
 00:26:48,000 --> 00:26:53,760
 and I put it on his head and I lift my finger,

 466
 00:26:53,760 --> 00:26:57,320
 you can have some pretty good idea of what's going to happen.

 467
 00:26:57,320 --> 00:26:59,080
 It's probably going to fall, right?

 468
 00:26:59,080 --> 00:27:01,520
 It's either going to fall on this side or that side.

 469
 00:27:01,520 --> 00:27:04,220
 You may not be able to predict this because I'm balancing it.

 470
 00:27:04,220 --> 00:27:06,520
 but it's going to fall on one side or the other.

 471
 00:27:06,520 --> 00:27:08,820
 So to some extent, at an abstract level,

 472
 00:27:08,820 --> 00:27:10,440
 you can say it's going to fall.

 473
 00:27:10,440 --> 00:27:12,720
 I can't tell you exactly in which position,

 474
 00:27:12,720 --> 00:27:15,120
 in which direction, but I can tell you it's going to fall.

 475
 00:27:15,120 --> 00:27:17,520
 You have an intuitive physics model,

 476
 00:27:17,520 --> 00:27:20,440
 which is in fact very sophisticated,

 477
 00:27:20,440 --> 00:27:23,280
 even though the situation is incredibly simple.

 478
 00:27:23,280 --> 00:27:27,060
 So that allows us to plan.

 479
 00:27:27,060 --> 00:27:29,200
 This model of the world is what allows us to plan.

 480
 00:27:29,200 --> 00:27:34,200
 So then we can have a system like this that has a task objective,

 481
 00:27:34,200 --> 00:27:38,040
 It sets itself an objective for itself,

 482
 00:27:38,040 --> 00:27:42,680
 or you set an objective that measures to what extent a task has been accomplished,

 483
 00:27:42,680 --> 00:27:48,200
 whether the resulting state of the world matches some condition.

 484
 00:27:48,520 --> 00:27:53,560
 You might also have a number of guardrail objectives,

 485
 00:27:53,560 --> 00:28:00,000
 things that make sure that whatever actions the agent takes,

 486
 00:28:00,000 --> 00:28:03,360
 nobody's going to get hurt, for example.

 487
 00:28:03,360 --> 00:28:08,360
 So those square boxes are cost functions,

 488
 00:28:08,360 --> 00:28:10,840
 they have an implicit scalar output,

 489
 00:28:10,840 --> 00:28:13,600
 and the overall energy of the system is just the sum of

 490
 00:28:13,600 --> 00:28:18,120
 the scalar outputs of all the red square boxes.

 491
 00:28:18,120 --> 00:28:19,760
 The other modules there,

 492
 00:28:19,760 --> 00:28:22,040
 the one with a round shape,

 493
 00:28:22,040 --> 00:28:24,920
 are deterministic functions, neural nets, let's say,

 494
 00:28:24,920 --> 00:28:27,200
 and the round shapes are variables.

 495
 00:28:27,200 --> 00:28:29,400
 The action sequence is a latent variable,

 496
 00:28:29,400 --> 00:28:32,680
 it's not observed, we're going to compute it by optimization.

 497
 00:28:32,680 --> 00:28:36,260
 We're going to try to find a sequence of actions that minimize

 498
 00:28:36,260 --> 00:28:40,160
 the sum of the task objective and the guardrail objectives,

 499
 00:28:40,160 --> 00:28:42,860
 and that's going to be the output of the system.

 500
 00:28:44,440 --> 00:28:47,160
 Again, that's intrinsically more powerful than

 501
 00:28:47,160 --> 00:28:50,580
 just running through a bunch of feed-forward layers.

 502
 00:28:50,580 --> 00:28:53,860
 So that's the basic architecture.

 503
 00:28:53,860 --> 00:28:57,000
 We can specialize this architecture further.

 504
 00:28:57,000 --> 00:28:58,820
 For a sequence of actions,

 505
 00:28:58,820 --> 00:29:02,060
 I might need to use my work model multiple times.

 506
 00:29:02,060 --> 00:29:06,680
 So if I move that model from here to here,

 507
 00:29:06,680 --> 00:29:08,200
 and then from here to here,

 508
 00:29:08,200 --> 00:29:09,460
 that's a sequence of two actions.

 509
 00:29:09,460 --> 00:29:11,640
 I don't need to have a separate model for those two actions.

 510
 00:29:11,640 --> 00:29:14,200
 It's the same model that is just applied twice.

 511
 00:29:14,200 --> 00:29:17,580
 So that's what's represented here,

 512
 00:29:17,580 --> 00:29:21,360
 where action one and action two go into the same model,

 513
 00:29:21,360 --> 00:29:24,700
 and it computes the resulting state.

 514
 00:29:24,700 --> 00:29:28,520
 Planning a sequence of actions to optimize a cost function,

 515
 00:29:28,520 --> 00:29:30,920
 according to a model that you run multiple times,

 516
 00:29:31,080 --> 00:29:35,480
 is a completely standard method in optimal control called model predictive control.

 517
 00:29:36,040 --> 00:29:41,720
 It's been around with us for since the early 60s so it's as old as me.

 518
 00:29:43,320 --> 00:29:50,920
 And this is what you know the entire optimal control community uses to do motion planning.

 519
 00:29:50,920 --> 00:29:57,160
 Robotics uses motion planning. NASA uses motion planning to you know plan the trajectory of

 520
 00:29:57,160 --> 00:29:58,920
 rockets to rendezvous the space station.

 521
 00:29:58,920 --> 00:30:00,780
 It's this type of model.

 522
 00:30:00,780 --> 00:30:03,480
 The difference here is that the world model is going to be learned.

 523
 00:30:03,480 --> 00:30:04,360
 It's going to be trained.

 524
 00:30:04,360 --> 00:30:08,080
 It's not going to be returned by hand with a bunch of equations.

 525
 00:30:08,080 --> 00:30:10,340
 It's going to be trained from data.

 526
 00:30:10,340 --> 00:30:13,540
 Of course, the question is, how do we do this?

 527
 00:30:13,540 --> 00:30:14,840
 I'll come to this in a second.

 528
 00:30:14,840 --> 00:30:18,320
 Now, the sad thing about the world is two things.

 529
 00:30:18,320 --> 00:30:24,060
 First thing is, you cannot run the world faster than real-time.

 530
 00:30:24,060 --> 00:30:27,500
 That's the limitation.

 531
 00:30:27,500 --> 00:30:28,940
 We have to deal with that.

 532
 00:30:28,940 --> 00:30:31,220
 The second one is that the world is not deterministic.

 533
 00:30:31,220 --> 00:30:36,160
 Or if it is deterministic as some physicists tell us it is,

 534
 00:30:36,160 --> 00:30:38,860
 it's not entirely predictable because we don't have

 535
 00:30:38,860 --> 00:30:41,960
 a full observation of the state of the world.

 536
 00:30:41,960 --> 00:30:45,260
 The way you model

 537
 00:30:45,260 --> 00:30:48,720
 non-deterministic functions out of deterministic functions,

 538
 00:30:48,720 --> 00:30:51,820
 is that you feed them extra inputs that are latent variables.

 539
 00:30:51,820 --> 00:30:54,560
 Those are variables whose values you don't know,

 540
 00:30:54,560 --> 00:30:57,480
 and you can make them swipe through a bunch of,

 541
 00:30:57,480 --> 00:31:01,100
 to a set or you can sample them from distributions.

 542
 00:31:01,100 --> 00:31:03,260
 For each value of the latent variable,

 543
 00:31:03,260 --> 00:31:06,260
 you get a different prediction from your model.

 544
 00:31:06,260 --> 00:31:10,220
 Okay. So a distribution over the latent variable implies

 545
 00:31:10,220 --> 00:31:13,580
 a distribution over the output of the model.

 546
 00:31:13,580 --> 00:31:17,060
 That's the way to handle uncertainty.

 547
 00:31:17,060 --> 00:31:20,260
 Of course, you know, you have to plan in the presence of uncertainty.

 548
 00:31:20,260 --> 00:31:28,260
 So you want to make sure that your plan will succeed regardless of what the values of the latent variable will be.

 549
 00:31:30,260 --> 00:31:37,260
 But in fact, humans and animals don't do planning this way. We do hierarchical planning.

 550
 00:31:37,260 --> 00:31:44,260
 So hierarchical planning means that we have multiple levels of abstraction for representing the state of the world.

 551
 00:31:44,260 --> 00:31:49,260
 We don't represent the world always with the same level of abstraction.

 552
 00:31:49,260 --> 00:31:52,660
 Let me take a concrete example here.

 553
 00:31:52,660 --> 00:31:56,100
 So let's say I'm sitting in my office in NYU

 554
 00:31:56,100 --> 00:31:57,520
 and I want to go to Paris.

 555
 00:31:58,540 --> 00:32:00,200
 At a very high abstract level,

 556
 00:32:00,200 --> 00:32:02,380
 I can predict that if I decide right now

 557
 00:32:02,380 --> 00:32:03,880
 to be in Paris tomorrow morning,

 558
 00:32:05,420 --> 00:32:07,640
 I can go to the airport tonight

 559
 00:32:07,640 --> 00:32:10,540
 and catch a plane to Paris and fly overnight.

 560
 00:32:11,500 --> 00:32:13,380
 That's a plan, it's a very high level plan.

 561
 00:32:13,380 --> 00:32:15,240
 I can't predict all the details of what's gonna happen,

 562
 00:32:15,240 --> 00:32:16,540
 but at a high level,

 563
 00:32:16,540 --> 00:32:21,540
 I know that I need to go to the airport and then catch a plane.

 564
 00:32:21,540 --> 00:32:24,540
 Now I have a sub-goal. How do I go to the airport?

 565
 00:32:24,540 --> 00:32:30,540
 Well, I need to go down on the street and hail a taxi because we're in New York.

 566
 00:32:30,540 --> 00:32:33,540
 How do I go down on the street?

 567
 00:32:33,540 --> 00:32:39,540
 I need to go to the elevator, push the button, and then walk out the door.

 568
 00:32:39,540 --> 00:32:42,540
 How do I go to the elevator?

 569
 00:32:42,540 --> 00:32:51,920
 I need to stand up from my chair, pick up my bag, open the door, close the door, walk to the elevator, avoid all the obstacles that I perceive, push the button.

 570
 00:32:53,120 --> 00:32:54,420
 How do I stand up from my chair?

 571
 00:32:56,060 --> 00:33:01,860
 So there is a level below which language is insufficient to express what we need to do.

 572
 00:33:02,800 --> 00:33:05,080
 You cannot explain to someone how you stand up from a chair.

 573
 00:33:06,540 --> 00:33:10,920
 You cannot have to know this in your muscle.

 574
 00:33:10,920 --> 00:33:13,800
 You need to understand the physical world to be able to do this.

 575
 00:33:13,800 --> 00:33:16,220
 So that's the other limitation of LLMs.

 576
 00:33:16,220 --> 00:33:20,420
 Their level of abstraction is high because they manipulate language,

 577
 00:33:20,420 --> 00:33:23,800
 but they're not grounded on reality.

 578
 00:33:23,800 --> 00:33:27,380
 They have no idea what the physical world is like.

 579
 00:33:27,380 --> 00:33:33,260
 That drives them to make really stupid mistakes and appear very,

 580
 00:33:33,260 --> 00:33:35,540
 very stupid in many situations.

 581
 00:33:35,540 --> 00:33:38,640
 So we need systems that really go

 582
 00:33:38,640 --> 00:33:41,160
 down all the way down to the level.

 583
 00:33:41,160 --> 00:33:43,960
 And this is what your house cat can do

 584
 00:33:43,960 --> 00:33:45,300
 and LLMs cannot do.

 585
 00:33:46,140 --> 00:33:48,140
 Which is why I'm saying your house cat is smarter

 586
 00:33:48,140 --> 00:33:50,800
 than the smartest LLMs.

 587
 00:33:50,800 --> 00:33:54,020
 Of course house cats don't have nearly as much

 588
 00:33:54,020 --> 00:33:58,600
 abstract knowledge stored in their memory as an LLM.

 589
 00:33:58,600 --> 00:34:02,120
 But they're really smart in their understanding of the world

 590
 00:34:02,120 --> 00:34:03,200
 and their ability to plan.

 591
 00:34:03,200 --> 00:34:05,200
 And they can plan hierarchically as well.

 592
 00:34:05,200 --> 00:34:13,540
 So what we need there is, you know, world models that are at multiple levels of abstraction,

 593
 00:34:13,540 --> 00:34:16,420
 and how to train this is not completely obvious.

 594
 00:34:16,420 --> 00:34:23,520
 Okay, so this whole idea, this whole kind of spiel leads to a view of AI that I call

 595
 00:34:23,520 --> 00:34:25,420
 Objective Driven AI Systems.

 596
 00:34:25,420 --> 00:34:26,760
 It's a recent name.

 597
 00:34:26,760 --> 00:34:33,700
 I wrote a vision paper two and a half years ago that I put online at this URL in open

 598
 00:34:33,700 --> 00:34:41,300
 review is not on archive because I work on comments and so that I can update this paper.

 599
 00:34:41,300 --> 00:34:47,260
 And it's the groundwork for the talk I'm giving at the moment, but in the last two and a half

 600
 00:34:47,260 --> 00:34:51,380
 years we've made progress towards that plan, so I'm going to give you some experimental

 601
 00:34:51,380 --> 00:34:55,340
 results and things we built.

 602
 00:34:55,340 --> 00:34:59,760
 So the architecture I'm proposing in that paper is a so-called cognitive architecture

 603
 00:34:59,760 --> 00:35:02,300
 that has the components I just expressed,

 604
 00:35:02,300 --> 00:35:03,800
 things like a perception module

 605
 00:35:03,800 --> 00:35:05,440
 that estimates the state of the world,

 606
 00:35:05,440 --> 00:35:08,440
 a memory that you can use,

 607
 00:35:08,440 --> 00:35:11,160
 a world model which is kind of a centerpiece a little bit,

 608
 00:35:11,160 --> 00:35:12,940
 a bunch of cost modules

 609
 00:35:12,940 --> 00:35:16,800
 that are either defining tasks or guardrails,

 610
 00:35:16,800 --> 00:35:18,840
 and then an actor, and what the actor does

 611
 00:35:18,840 --> 00:35:20,720
 is that basically finding,

 612
 00:35:20,720 --> 00:35:22,520
 doing this optimization procedure,

 613
 00:35:22,520 --> 00:35:24,020
 finding the best sequence of actions

 614
 00:35:24,020 --> 00:35:26,380
 to satisfy the objectives.

 615
 00:35:26,380 --> 00:35:28,600
 This is mysterious configurator module at the top,

 616
 00:35:28,600 --> 00:35:29,780
 I'm not going to explain,

 617
 00:35:29,780 --> 00:35:36,160
 but basically its role would be to set the goal for the current situation.

 618
 00:35:36,160 --> 00:35:37,160
 Okay.

 619
 00:35:37,160 --> 00:35:43,100
 Okay. So perhaps with an architecture of this type,

 620
 00:35:43,100 --> 00:35:45,840
 we will have systems that understand the physical world, etc.

 621
 00:35:45,840 --> 00:35:51,000
 But we have to, and have system two ability of kind of reasoning.

 622
 00:35:51,000 --> 00:35:55,460
 But then how can we learn those world models from sensory inputs?

 623
 00:35:55,460 --> 00:35:57,520
 That's really kind of the trick.

 624
 00:35:57,520 --> 00:36:00,280
 And the answer to this is self-supervised learning.

 625
 00:36:00,280 --> 00:36:07,280
 So self-supervised learning is something that has been extremely successful in the context of natural language understanding over the last few years.

 626
 00:36:07,280 --> 00:36:10,080
 Basically it's completely dominating NLP.

 627
 00:36:10,080 --> 00:36:15,160
 Every NLP system, LLM, etc. are trained with self-supervised learning.

 628
 00:36:15,160 --> 00:36:19,080
 What does that mean? It means that there is no difference between inputs and outputs.

 629
 00:36:19,080 --> 00:36:27,480
 Basically you take a big input, you corrupt it in some way, and you train some gigantic neural net to restore the full input if you want.

 630
 00:36:27,480 --> 00:36:32,480
 But, you know, it's not going to be sufficient.

 631
 00:36:32,480 --> 00:36:36,480
 We're still, you know, we're missing another piece of evidence

 632
 00:36:36,480 --> 00:36:39,480
 that we're missing something big about intelligence is that,

 633
 00:36:39,480 --> 00:36:44,480
 although we have NLMs that can pass the bar exam,

 634
 00:36:44,480 --> 00:36:50,480
 or some high school exams, maybe not calculus one, I don't know,

 635
 00:36:50,480 --> 00:36:56,480
 we still do not have domestic robots that can accomplish tasks

 636
 00:36:56,480 --> 00:37:00,480
 A 10 year old can learn in one shot or zero shot.

 637
 00:37:00,480 --> 00:37:02,480
 The first time you ask a 10 year old,

 638
 00:37:02,480 --> 00:37:04,480
 clear the dinner table and fill up the dishwasher,

 639
 00:37:04,480 --> 00:37:06,480
 they're able to do it.

 640
 00:37:06,480 --> 00:37:08,480
 They don't need to learn.

 641
 00:37:08,480 --> 00:37:10,480
 They can just plan.

 642
 00:37:12,480 --> 00:37:14,480
 Any 17 year old can learn to drive a car

 643
 00:37:14,480 --> 00:37:16,480
 in about 20 hours of practice.

 644
 00:37:16,480 --> 00:37:20,480
 We still do not have level 5 autonomous self-driving cars.

 645
 00:37:20,480 --> 00:37:22,480
 We have level 2, we have level 3,

 646
 00:37:22,480 --> 00:37:24,480
 so they're partially autonomous.

 647
 00:37:24,480 --> 00:37:29,360
 autonomous. We have some level fives in limited areas, but they are very

 648
 00:37:29,360 --> 00:37:32,700
 instrumented and they cheat. They have a map of the entire environment, so if you

 649
 00:37:32,700 --> 00:37:36,660
 think about the Waymo cars, that's where they are. And they certainly don't

 650
 00:37:36,660 --> 00:37:42,120
 need only 20 hours of practice to learn to drive. So that's what we're missing,

 651
 00:37:42,120 --> 00:37:47,140
 something big. And that's really a new version of the Moravec paradox that, you

 652
 00:37:47,140 --> 00:37:50,880
 know, things that are easy for humans are difficult for AI and vice versa. And

 653
 00:37:50,880 --> 00:37:54,760
 we've tended to neglect the complexity

 654
 00:37:54,760 --> 00:37:55,940
 of dealing with the real world,

 655
 00:37:55,940 --> 00:38:00,720
 like perception and action, motor control.

 656
 00:38:00,720 --> 00:38:02,320
 Perhaps a reason for this

 657
 00:38:02,320 --> 00:38:05,480
 resides in this really simple calculation.

 658
 00:38:05,480 --> 00:38:07,560
 An LLM, a typical LLM of today,

 659
 00:38:07,560 --> 00:38:10,060
 is trained on 20 trillion tokens, okay?

 660
 00:38:10,060 --> 00:38:11,360
 Two, 10 to the 13.

 661
 00:38:13,300 --> 00:38:17,140
 That corresponds to a little less than 20 trillion words,

 662
 00:38:17,140 --> 00:38:18,560
 because the token is a subword unit.

 663
 00:38:18,560 --> 00:38:21,860
 Each token usually is represented by three bytes

 664
 00:38:21,860 --> 00:38:22,680
 or something like that.

 665
 00:38:22,680 --> 00:38:25,920
 So that is a volume of training data

 666
 00:38:25,920 --> 00:38:27,880
 of six, 10 to the 13 bytes.

 667
 00:38:29,800 --> 00:38:31,420
 That would take a few hundred thousand years

 668
 00:38:31,420 --> 00:38:33,320
 for any of us to read through that material.

 669
 00:38:33,320 --> 00:38:36,800
 It's basically the entire text

 670
 00:38:36,800 --> 00:38:38,400
 available publicly on the internet.

 671
 00:38:39,940 --> 00:38:43,200
 Now a human child, a four-year-old,

 672
 00:38:43,200 --> 00:38:46,280
 has been awake a total of 16,000 hours.

 673
 00:38:46,280 --> 00:38:49,780
 That's what developmental psychologists tell me.

 674
 00:38:50,640 --> 00:38:52,040
 Which by the way is not a lot of data,

 675
 00:38:52,040 --> 00:38:54,040
 that's 30 minutes of YouTube uploads.

 676
 00:38:56,940 --> 00:39:00,880
 And I don't know how much Instagram, I should.

 677
 00:39:02,000 --> 00:39:05,140
 We have two million optic nerve fibers

 678
 00:39:05,140 --> 00:39:07,640
 going to our brain through our eyes.

 679
 00:39:07,640 --> 00:39:09,800
 The amount of information getting to the eyes is enormous

 680
 00:39:09,800 --> 00:39:12,040
 because we have 100 million photosensors

 681
 00:39:12,040 --> 00:39:13,540
 or something like that.

 682
 00:39:13,540 --> 00:39:15,660
 But it's being reduced to squeeze down

 683
 00:39:15,660 --> 00:39:18,100
 to the optical nerve before it gets to the brain.

 684
 00:39:18,100 --> 00:39:20,540
 And that's about two million nerve fibers,

 685
 00:39:20,540 --> 00:39:23,300
 each carrying a little less than one byte per second,

 686
 00:39:23,300 --> 00:39:25,020
 a few bits per second, okay?

 687
 00:39:25,020 --> 00:39:30,020
 So the volume of data there is about 10 to the 14 bytes,

 688
 00:39:32,040 --> 00:39:32,880
 maybe a little less.

 689
 00:39:32,880 --> 00:39:36,000
 It's the same order of magnitude as the biggest LLM.

 690
 00:39:36,000 --> 00:39:38,900
 In four years, a child has seen more data

 691
 00:39:40,260 --> 00:39:43,960
 about the real world than the biggest LLM trained

 692
 00:39:43,960 --> 00:39:46,580
 on the entirety of all the publicly available texts

 693
 00:39:46,580 --> 00:39:48,660
 on the internet that we take any of us,

 694
 00:39:50,360 --> 00:39:52,560
 you know, hundreds of millennia to read through.

 695
 00:39:53,500 --> 00:39:55,240
 So that tells you we're never gonna reach

 696
 00:39:55,240 --> 00:39:57,120
 human level intelligence by training on text.

 697
 00:39:57,120 --> 00:39:58,300
 It's just not happening.

 698
 00:39:59,360 --> 00:40:01,960
 Okay, we need systems to really understand the world

 699
 00:40:01,960 --> 00:40:05,900
 through high bandwidth input, like vision or touch.

 700
 00:40:05,900 --> 00:40:07,320
 Okay, blind people can get smart

 701
 00:40:07,320 --> 00:40:09,220
 because they have other senses.

 702
 00:40:11,780 --> 00:40:13,880
 And in fact, you know, if you look at how long it takes

 703
 00:40:13,880 --> 00:40:21,300
 For children, infants, to learn basic concepts about the real world, it takes several months.

 704
 00:40:21,940 --> 00:40:29,500
 So a child will learn the difference between animate and inanimate objects within the first

 705
 00:40:29,500 --> 00:40:34,040
 three months of life, opening their eyes. Object permanence appears really early,

 706
 00:40:34,440 --> 00:40:39,980
 maybe around two months. Notions of solidity, rigidity, and stability and support,

 707
 00:40:39,980 --> 00:40:45,340
 that's in the first six months. So this idea that, you know, this is not going to be stable is going

 708
 00:40:45,340 --> 00:40:53,900
 to fall. And then notions of intuitive physics like gravity, inertia, conservation of momentum,

 709
 00:40:53,900 --> 00:40:59,260
 this kind of stuff, that we have an intuitive level that any animal has too, that only pops

 710
 00:40:59,260 --> 00:41:04,380
 up around nine months in baby humans, much earlier in baby goats and other animals.

 711
 00:41:09,980 --> 00:41:14,940
 Most of that is through observation. There's not much interaction. You know, babies can hardly

 712
 00:41:14,940 --> 00:41:20,380
 affect the world in the first four months of life. They do afterwards. If you put an eight-month-old

 713
 00:41:20,380 --> 00:41:24,140
 baby on a chair with a bunch of toys, the first thing they'll do is throw the toys on the ground

 714
 00:41:24,140 --> 00:41:28,460
 because that's how they do the experiment about gravity. You know, does it apply to this new thing

 715
 00:41:28,460 --> 00:41:36,220
 I'm seeing on my chair? Okay, so there is a very natural idea which is to transpose the stuff that

 716
 00:41:36,220 --> 00:41:38,820
 that has worked for text to video.

 717
 00:41:38,820 --> 00:41:42,360
 Can we just train a generative model to learn to predict video?

 718
 00:41:42,360 --> 00:41:44,760
 And then that system will just understand how the world works,

 719
 00:41:44,760 --> 00:41:48,020
 because it's going to be able to predict what happens in the video.

 720
 00:41:48,020 --> 00:41:53,640
 And it's been a bit of my obsession in terms of research for

 721
 00:41:53,640 --> 00:41:56,760
 the last at least 15 years, if not more.

 722
 00:41:56,760 --> 00:41:59,460
 Okay, so this predates LLMs and everything.

 723
 00:41:59,460 --> 00:42:01,520
 Okay, this idea that you can learn by prediction,

 724
 00:42:01,520 --> 00:42:03,120
 it's a very old concept in neuroscience,

 725
 00:42:03,120 --> 00:42:05,720
 but it's something I've really been sort of,

 726
 00:42:05,720 --> 00:42:08,480
 working on with my students,

 727
 00:42:08,480 --> 00:42:11,520
 collaborators for many years.

 728
 00:42:11,520 --> 00:42:15,280
 And the idea of course is to use a generative model, right?

 729
 00:42:15,280 --> 00:42:18,640
 Give to a system a piece of video,

 730
 00:42:19,240 --> 00:42:23,320
 and then try to predict what's going to happen next in the video.

 731
 00:42:23,320 --> 00:42:28,000
 Just the same way that we train LLMs to predict what happens next in the text.

 732
 00:42:28,800 --> 00:42:33,560
 Perhaps if you want the system to be kind of a role model,

 733
 00:42:33,560 --> 00:42:37,180
 you can feed this model with an action variable,

 734
 00:42:37,180 --> 00:42:38,680
 the A variable here,

 735
 00:42:38,680 --> 00:42:42,040
 which in this case would simply be masking essentially.

 736
 00:42:42,040 --> 00:42:43,780
 So take a video, mask a piece of it,

 737
 00:42:43,780 --> 00:42:45,600
 let's say the second half of it,

 738
 00:42:45,600 --> 00:42:47,080
 run it through some big neural net and

 739
 00:42:47,080 --> 00:42:50,500
 train it to predict the second half of the full video.

 740
 00:42:50,760 --> 00:42:54,740
 We tried for a good part of 15 years,

 741
 00:42:54,740 --> 00:42:56,500
 it doesn't work.

 742
 00:42:56,500 --> 00:42:59,620
 It doesn't work because there are many,

 743
 00:42:59,620 --> 00:43:02,000
 many things that can happen in a video and a system of

 744
 00:43:02,000 --> 00:43:04,000
 This type basically will just predict one thing.

 745
 00:43:05,700 --> 00:43:07,880
 And so one way to deal with this problem

 746
 00:43:07,880 --> 00:43:10,240
 of predicting one thing, so it's gonna predict one thing.

 747
 00:43:10,240 --> 00:43:12,840
 So the best thing you can predict is the average

 748
 00:43:12,840 --> 00:43:15,640
 of all the possible, plausible things that may happen.

 749
 00:43:15,640 --> 00:43:16,620
 And you see an example here,

 750
 00:43:16,620 --> 00:43:19,060
 that's an early paper in video prediction,

 751
 00:43:19,060 --> 00:43:20,720
 trying to predict what's gonna happen

 752
 00:43:20,720 --> 00:43:24,820
 is this really short six frame video with this little girl.

 753
 00:43:24,820 --> 00:43:27,400
 The four frame, the first four frames are observed,

 754
 00:43:27,400 --> 00:43:30,460
 the last two are predicted, and what you see is a blurry mess,

 755
 00:43:30,460 --> 00:43:31,640
 because the system really cannot predict

 756
 00:43:31,640 --> 00:43:34,200
 What's going to happen is we predict the average.

 757
 00:43:34,400 --> 00:43:36,840
 You see this at the bottom as well,

 758
 00:43:36,840 --> 00:43:38,880
 if you can play that video again.

 759
 00:43:38,880 --> 00:43:41,400
 This is a top-down view of a highway,

 760
 00:43:41,400 --> 00:43:43,600
 and the green things are like cars.

 761
 00:43:43,600 --> 00:43:46,800
 The second column are predictions made by

 762
 00:43:46,800 --> 00:43:49,280
 neural net trying to predict what's going to happen in that video.

 763
 00:43:49,280 --> 00:43:52,960
 You see those blurry extending cars

 764
 00:43:52,960 --> 00:43:55,720
 because it really cannot predict what's happening.

 765
 00:43:55,720 --> 00:43:58,840
 So the columns on the right are

 766
 00:43:58,840 --> 00:44:01,160
 a different model that has a latent variable which is

 767
 00:44:01,160 --> 00:44:04,760
 designed to capture the variability between the potential prediction,

 768
 00:44:04,760 --> 00:44:07,200
 and those predictions are not blurry.

 769
 00:44:07,200 --> 00:44:14,180
 So we thought that we had a good solution to that problem five years ago with latent variables,

 770
 00:44:14,180 --> 00:44:16,580
 but it turns out to not work for real video.

 771
 00:44:16,580 --> 00:44:18,200
 It works for simple videos like this one,

 772
 00:44:18,200 --> 00:44:20,980
 but it doesn't for real world.

 773
 00:44:20,980 --> 00:44:24,120
 So we can't train this thing on video.

 774
 00:44:24,120 --> 00:44:26,880
 So the solution to that problem is interesting,

 775
 00:44:26,880 --> 00:44:30,060
 is to abandon the whole idea of generative models.

 776
 00:44:30,060 --> 00:44:37,060
 Everybody is talking about generality model like it's the new Messiah.

 777
 00:44:37,060 --> 00:44:41,420
 What I'm telling you today is forget about generality models.

 778
 00:44:41,420 --> 00:44:45,120
 Okay. The solution to that problem,

 779
 00:44:45,120 --> 00:44:48,280
 we think, is what we call joint embedding architectures,

 780
 00:44:48,280 --> 00:44:51,680
 or more precisely joint embedding predictive architectures.

 781
 00:44:51,680 --> 00:44:53,840
 This is really the way to build a world model.

 782
 00:44:53,840 --> 00:44:56,180
 So what is this consistent?

 783
 00:44:56,180 --> 00:44:58,000
 It's you take that video,

 784
 00:44:58,000 --> 00:44:59,900
 you corrupt it, you mask a piece of it,

 785
 00:44:59,900 --> 00:45:01,720
 for example, okay?

 786
 00:45:01,720 --> 00:45:04,060
 And you run it through a big neural net,

 787
 00:45:04,060 --> 00:45:05,920
 but what the big neural net is trained to do

 788
 00:45:05,920 --> 00:45:08,520
 is not predict all the pixels in the video,

 789
 00:45:08,520 --> 00:45:11,320
 it's trained to predict an abstract representation

 790
 00:45:12,400 --> 00:45:14,360
 of the future of that video, okay?

 791
 00:45:14,360 --> 00:45:16,280
 So you take the original video,

 792
 00:45:16,280 --> 00:45:17,460
 you take the masked one,

 793
 00:45:17,460 --> 00:45:18,960
 you run them through encoders,

 794
 00:45:18,960 --> 00:45:21,520
 now you have abstract representations

 795
 00:45:21,520 --> 00:45:24,920
 of the full video and the corrupted one,

 796
 00:45:24,920 --> 00:45:26,820
 and you train a predictor

 797
 00:45:26,820 --> 00:45:28,540
 to predict the representation of the full video,

 798
 00:45:28,540 --> 00:45:30,900
 from the representation of the corrupted one.

 799
 00:45:32,020 --> 00:45:32,820
 Okay.

 800
 00:45:32,820 --> 00:45:33,700
 This is called JEPA.

 801
 00:45:33,700 --> 00:45:35,660
 That means Joint Embedding Predictive Architecture.

 802
 00:45:35,660 --> 00:45:37,580
 There's a bunch of papers from the last few years

 803
 00:45:37,580 --> 00:45:41,340
 that my collaborators and I have published on this idea.

 804
 00:45:41,340 --> 00:45:43,780
 And it solves the problem of having to predict

 805
 00:45:43,780 --> 00:45:47,100
 all kinds of details that you really cannot predict.

 806
 00:45:47,100 --> 00:45:49,580
 So if I were to take a video of this crowd,

 807
 00:45:50,980 --> 00:45:52,940
 in fact I can take a video of this crowd.

 808
 00:45:55,020 --> 00:45:57,380
 Okay, now I'm taking a video of you guys.

 809
 00:45:57,380 --> 00:46:01,460
 Okay, and I slowly turn my head towards the right.

 810
 00:46:03,440 --> 00:46:04,780
 Gonna shut down the video now.

 811
 00:46:06,740 --> 00:46:09,860
 Certainly, a prediction system can predict this is a room,

 812
 00:46:09,860 --> 00:46:13,280
 it's a conference room, there's people sitting everywhere.

 813
 00:46:13,280 --> 00:46:16,420
 It may not be able to predict that all the chairs are full.

 814
 00:46:16,420 --> 00:46:18,000
 It certainly cannot predict

 815
 00:46:18,000 --> 00:46:20,080
 what every single one of you looks like.

 816
 00:46:20,080 --> 00:46:21,060
 There's absolutely no way.

 817
 00:46:21,060 --> 00:46:22,800
 It cannot predict what the texture on the wall

 818
 00:46:22,800 --> 00:46:26,860
 is going to be, or even the color of the side.

 819
 00:46:26,860 --> 00:46:30,200
 So there are things that are just completely unpredictable.

 820
 00:46:30,200 --> 00:46:31,620
 You don't have the information to do it.

 821
 00:46:31,620 --> 00:46:34,260
 And if you train a system to predict all those details,

 822
 00:46:34,260 --> 00:46:36,240
 it's going to spend all of its resources

 823
 00:46:36,240 --> 00:46:37,660
 predicting irrelevant details.

 824
 00:46:38,540 --> 00:46:40,220
 So what a jet pad does when you train it,

 825
 00:46:40,220 --> 00:46:41,980
 and I'm gonna tell you how you train this,

 826
 00:46:41,980 --> 00:46:45,700
 is that it finds a trade-off between extracting

 827
 00:46:45,700 --> 00:46:48,040
 as much information as possible from the input,

 828
 00:46:48,040 --> 00:46:50,340
 but only extracting things that it can predict.

 829
 00:46:53,260 --> 00:46:55,100
 And there is an issue with those kinds of architectures.

 830
 00:46:55,100 --> 00:47:01,100
 Here is a contrast between the generative architecture that tried to reproduce Y directly

 831
 00:47:01,100 --> 00:47:06,640
 and the joint embedding architecture which only tries to do prediction in representation

 832
 00:47:06,640 --> 00:47:09,560
 space on the right.

 833
 00:47:09,560 --> 00:47:14,480
 There's a problem with the joint embedding architecture and this is why we've only been

 834
 00:47:14,480 --> 00:47:16,100
 working on this in recent years.

 835
 00:47:16,100 --> 00:47:21,100
 It is the fact that if you just train the parameters of those neural nets to minimize

 836
 00:47:21,100 --> 00:47:23,940
 the prediction error, it collapses.

 837
 00:47:23,940 --> 00:47:27,340
 basically ignores the inputs X and Y.

 838
 00:47:27,340 --> 00:47:29,400
 It makes prediction for SX and SY,

 839
 00:47:29,400 --> 00:47:32,260
 the two representations that are constant.

 840
 00:47:32,260 --> 00:47:34,180
 Another prediction problem is trivial.

 841
 00:47:37,220 --> 00:47:39,200
 And that's not a good thing.

 842
 00:47:39,200 --> 00:47:43,240
 So that's an example of this energy-based framework

 843
 00:47:43,240 --> 00:47:44,960
 that I was describing earlier.

 844
 00:47:46,060 --> 00:47:50,200
 It gives zero energy to every pair of XY, essentially.

 845
 00:47:50,200 --> 00:47:51,420
 But what you want is zero energy

 846
 00:47:51,420 --> 00:47:53,160
 for the pairs of XY you're training on,

 847
 00:47:53,160 --> 00:47:55,940
 but higher energy for things that you don't train it on,

 848
 00:47:55,940 --> 00:47:57,820
 and that's the hard part.

 849
 00:47:57,820 --> 00:48:01,780
 So next I'm going to explain how you make that possible,

 850
 00:48:01,780 --> 00:48:05,280
 how you make sure that the pairs of XY

 851
 00:48:05,280 --> 00:48:07,480
 that are not compatible have a higher energy.

 852
 00:48:09,740 --> 00:48:12,140
 There's variations of those architectures,

 853
 00:48:12,140 --> 00:48:14,220
 some of which can be sort of have latent variables

 854
 00:48:14,220 --> 00:48:17,140
 or have the action condition if you want

 855
 00:48:17,140 --> 00:48:18,680
 to predict it to be a model.

 856
 00:48:19,720 --> 00:48:22,240
 And there's been papers on this for many years now.

 857
 00:48:22,240 --> 00:48:24,200
 The earliest joint embedding architecture actually

 858
 00:48:24,200 --> 00:48:25,320
 is from the early 90s.

 859
 00:48:25,320 --> 00:48:28,000
 It's a paper of mine about Siamese networks.

 860
 00:48:30,060 --> 00:48:31,720
 But we're gonna have to train

 861
 00:48:31,720 --> 00:48:34,240
 those sort of generic architectures.

 862
 00:48:34,240 --> 00:48:36,400
 So how do we do this?

 863
 00:48:37,440 --> 00:48:38,680
 So remember this picture, right?

 864
 00:48:38,680 --> 00:48:41,260
 We wanna give low energy to stuff that are compatible,

 865
 00:48:41,260 --> 00:48:43,260
 things that we observe, training sets,

 866
 00:48:43,260 --> 00:48:44,940
 training samples, X and Y,

 867
 00:48:44,940 --> 00:48:46,440
 higher energy to everything else.

 868
 00:48:47,740 --> 00:48:48,860
 So there are two sets of methods,

 869
 00:48:48,860 --> 00:48:51,840
 contrasting methods and what I call regularized methods.

 870
 00:48:51,840 --> 00:49:00,640
 So contrastive method consists in basically generating contrastive pairs of X and Y that are not in the training set.

 871
 00:49:01,520 --> 00:49:04,560
 So pick an X and pick another Y that's not compatible with it.

 872
 00:49:04,560 --> 00:49:06,920
 And that gives you one of those green dots that you see flashing.

 873
 00:49:08,040 --> 00:49:13,860
 And your loss function is going to consist in pushing down on the energy of the blue dots, which are the training samples,

 874
 00:49:14,040 --> 00:49:17,760
 and then pushing up on the energy of the green dots, which are those contrastive samples.

 875
 00:49:17,760 --> 00:49:24,120
 Okay, this is a good idea and there's a bunch of algorithms that people have used to train this.

 876
 00:49:24,120 --> 00:49:29,200
 Some of them, for example, for joint embedding between images and text, are things like Clip

 877
 00:49:29,200 --> 00:49:36,960
 from OpenAI. They use contrasting methods. Seem clear from a team at Google that includes Jeff

 878
 00:49:36,960 --> 00:49:43,980
 Hinton. And then Siamese Nets back from the 90s that I used to advocate. The issue with contrasting

 879
 00:49:43,980 --> 00:49:47,980
 methods is that the intrinsic dimension of the embedding that they produce is

 880
 00:49:47,980 --> 00:49:53,460
 usually fairly low and so the representations that are learned by it

 881
 00:49:53,460 --> 00:49:57,480
 are kind of degenerate a little bit. So I prefer the regularized method. What is

 882
 00:49:57,480 --> 00:50:02,100
 the idea behind the regularized method? The idea is that you minimize the volume

 883
 00:50:02,100 --> 00:50:07,980
 of space that can take low energy. So you have some sort of regularizer term in

 884
 00:50:07,980 --> 00:50:11,580
 your loss function and that term basically measures the volume of stuff

 885
 00:50:11,580 --> 00:50:17,180
 that has low energy and you try to minimize it. So what that means is that whenever you push down

 886
 00:50:17,180 --> 00:50:22,140
 the energy of one region of that space, the rest has to go up because there's only a limited amount

 887
 00:50:22,140 --> 00:50:29,740
 of low energy volume to go around. And you know that sounds a little abstract and mysterious,

 888
 00:50:29,740 --> 00:50:35,660
 but in practice the way you do this is there's like a handful of methods to do this,

 889
 00:50:35,660 --> 00:50:39,660
 which I'm going to explain in a second. Before that I'm going to tell you how you test how well

 890
 00:50:39,660 --> 00:50:40,840
 how those systems work, right?

 891
 00:50:40,840 --> 00:50:43,640
 So in the context of image recognition,

 892
 00:50:43,640 --> 00:50:46,240
 you give two images that you know are the same image,

 893
 00:50:46,240 --> 00:50:48,740
 either, so you take an image and you corrupt it,

 894
 00:50:48,740 --> 00:50:50,980
 or you transform it in some way.

 895
 00:50:50,980 --> 00:50:52,660
 You change the scale, you rotate it,

 896
 00:50:52,660 --> 00:50:53,820
 you change the colors a little bit,

 897
 00:50:53,820 --> 00:50:56,060
 maybe you mask parts of it, okay?

 898
 00:50:56,060 --> 00:50:58,840
 And then you train an encoder on a predictor

 899
 00:50:58,840 --> 00:51:01,020
 so that the predictor predicts the representation

 900
 00:51:01,020 --> 00:51:03,220
 of the full image from the representation

 901
 00:51:03,220 --> 00:51:05,940
 of the corrupted one.

 902
 00:51:05,940 --> 00:51:07,600
 And then once the system is trained,

 903
 00:51:07,600 --> 00:51:09,020
 you chop off the predictor,

 904
 00:51:09,020 --> 00:51:11,520
 you use the encoder as input to a classifier,

 905
 00:51:11,520 --> 00:51:14,440
 and you train a supervised classifier to do things

 906
 00:51:14,440 --> 00:51:17,020
 like object recognition or something of that type.

 907
 00:51:17,020 --> 00:51:19,940
 So that's a way of measuring the quality of the features

 908
 00:51:19,940 --> 00:51:24,060
 that have been learned by the system.

 909
 00:51:24,060 --> 00:51:28,560
 There's been a number of papers on this,

 910
 00:51:28,560 --> 00:51:33,220
 and what has been transpiring is that those methods work really

 911
 00:51:33,220 --> 00:51:36,180
 well to train a system to extract

 912
 00:51:36,180 --> 00:51:37,660
 Generate features from images,

 913
 00:51:37,660 --> 00:51:39,660
 the joint embedding architectures.

 914
 00:51:39,660 --> 00:51:42,020
 There's been a lot of work also on

 915
 00:51:42,020 --> 00:51:45,080
 generative architectures like autoencoders,

 916
 00:51:45,080 --> 00:51:47,500
 variational autoencoders, VQVAEs,

 917
 00:51:47,500 --> 00:51:49,620
 masked autoencoders, denosing autoencoders,

 918
 00:51:49,620 --> 00:51:51,780
 all kinds of techniques of this type that basically,

 919
 00:51:51,780 --> 00:51:53,860
 you give a corrupted version of an image,

 920
 00:51:53,860 --> 00:51:55,460
 and then you train the system to

 921
 00:51:55,460 --> 00:51:57,780
 recover the full image at the pixel level.

 922
 00:51:57,780 --> 00:52:00,180
 Those methods do not work nearly as

 923
 00:52:00,180 --> 00:52:02,260
 well as the joint embedding methods.

 924
 00:52:02,260 --> 00:52:04,700
 We discovered this five or six years ago,

 925
 00:52:04,700 --> 00:52:09,460
 not just us, but there was an accumulating amount of evidence showing that joint invading

 926
 00:52:09,460 --> 00:52:17,300
 was really superior to reconstruction based systems, so to generative architectures.

 927
 00:52:17,300 --> 00:52:20,760
 And at the time, the methods for training were only contrastive.

 928
 00:52:20,760 --> 00:52:25,260
 But now we've found some other techniques, and one technique in particular that, or one

 929
 00:52:25,260 --> 00:52:30,940
 set of techniques that attempt to maximize some measure of information, information content

 930
 00:52:30,940 --> 00:52:32,640
 coming out of the encoder.

 931
 00:52:32,640 --> 00:52:36,320
 So one of the criteria used for training is this minus i,

 932
 00:52:36,320 --> 00:52:38,040
 the measure of information content.

 933
 00:52:38,040 --> 00:52:39,660
 Since we minimize cost function,

 934
 00:52:39,660 --> 00:52:40,720
 there is a minus sign in front,

 935
 00:52:40,720 --> 00:52:42,760
 so you maximize information content.

 936
 00:52:42,760 --> 00:52:44,500
 How do we do this?

 937
 00:52:44,500 --> 00:52:47,060
 So one simple trick that we've used is something called

 938
 00:52:47,060 --> 00:52:49,640
 variance covariance regularization.

 939
 00:52:49,640 --> 00:52:52,540
 Or in the case where you don't have predictor,

 940
 00:52:52,540 --> 00:52:55,880
 it's Vcreg, variance invariance covariance regularization.

 941
 00:52:55,880 --> 00:52:57,900
 And there the idea is you take

 942
 00:52:57,900 --> 00:53:00,260
 the representation coming out of the encoder and you say,

 943
 00:53:00,260 --> 00:53:04,900
 First of all, you should not collapse to a fixed set of values.

 944
 00:53:04,900 --> 00:53:07,500
 So the variance of each variable coming out of

 945
 00:53:07,500 --> 00:53:10,600
 the encoder should be at least one, let's say.

 946
 00:53:10,600 --> 00:53:13,400
 Okay. Now the system can still cheat and not produce

 947
 00:53:13,400 --> 00:53:16,020
 very informative outputs by basically producing

 948
 00:53:16,020 --> 00:53:18,860
 the same variable or very correlated variable for

 949
 00:53:18,860 --> 00:53:22,620
 all the dimensions of the output representation.

 950
 00:53:22,620 --> 00:53:26,900
 So another criterion tries to decorrelate those variables.

 951
 00:53:26,900 --> 00:53:29,760
 And in fact, we use a trick that we expand the dimension,

 952
 00:53:29,760 --> 00:53:32,200
 We take the representation, run it through a neural net

 953
 00:53:32,200 --> 00:53:33,680
 that expands the dimension,

 954
 00:53:33,680 --> 00:53:35,000
 and then decorrelate in that space,

 955
 00:53:35,000 --> 00:53:37,000
 and that has the effect of actually making

 956
 00:53:37,000 --> 00:53:39,620
 the original variable more independent of each other,

 957
 00:53:39,620 --> 00:53:41,080
 not just uncorrelated.

 958
 00:53:41,960 --> 00:53:43,920
 So it's a bit of a hack,

 959
 00:53:43,920 --> 00:53:46,000
 because what we're trying to do here

 960
 00:53:46,000 --> 00:53:47,680
 is maximizing information content,

 961
 00:53:47,680 --> 00:53:49,620
 and what we should have to be able to do this

 962
 00:53:49,620 --> 00:53:52,280
 is a lower bound on information content.

 963
 00:53:52,280 --> 00:53:54,040
 But what I'm describing here

 964
 00:53:54,040 --> 00:53:56,680
 is an upper bound on information content.

 965
 00:53:56,680 --> 00:53:58,280
 So we're maximizing an upper bound,

 966
 00:53:58,280 --> 00:54:05,720
 Then we cross our fingers that the actual information content will follow.

 967
 00:54:05,720 --> 00:54:06,520
 Okay.

 968
 00:54:06,520 --> 00:54:09,720
 And it works.

 969
 00:54:09,720 --> 00:54:13,880
 So that's one set of techniques.

 970
 00:54:13,880 --> 00:54:15,160
 I'm going to skip the theory.

 971
 00:54:15,160 --> 00:54:18,200
 There is another set of method called distillations,

 972
 00:54:18,200 --> 00:54:19,880
 and those have proved to be extremely efficient.

 973
 00:54:21,080 --> 00:54:25,160
 And there, it's another hack, and we only have partial,

 974
 00:54:25,160 --> 00:54:29,400
 At least in my opinion partial theoretical understanding of why it works, but it does work.

 975
 00:54:30,760 --> 00:54:35,640
 In there we share the weights between the two encoders with a technique called exponential

 976
 00:54:35,640 --> 00:54:40,440
 moving average. So one encoder has the weights that are basically a temporal average of the

 977
 00:54:40,440 --> 00:54:44,680
 weights of the other one for mysterious reasons. And we train the whole thing but we don't back

 978
 00:54:44,680 --> 00:54:50,280
 propagate gradient to the one that gets this moving average that gets the full input.

 979
 00:54:50,280 --> 00:54:54,180
 And somehow this does not collapse and it works really well.

 980
 00:54:54,180 --> 00:54:56,020
 It's called a distillation method.

 981
 00:54:56,020 --> 00:54:58,020
 There's various versions of it.

 982
 00:54:58,020 --> 00:55:05,780
 Cinsiam, BYOL from DeepMind, Dinov2 from my colleagues in Paris at Meta, iJepa and VJepa

 983
 00:55:05,780 --> 00:55:09,780
 from the people at Meta who work with me.

 984
 00:55:09,780 --> 00:55:10,780
 This works amazingly well.

 985
 00:55:10,780 --> 00:55:16,300
 It works so well, in fact, the Dinov2 version works incredibly well.

 986
 00:55:16,300 --> 00:55:18,780
 It's a generic feature extractor for images.

 987
 00:55:18,780 --> 00:55:21,580
 If you have some random computer vision problem,

 988
 00:55:21,580 --> 00:55:23,540
 and no one has trained a system for that,

 989
 00:55:23,540 --> 00:55:26,020
 just download Dinov2, it will extract features

 990
 00:55:26,020 --> 00:55:28,280
 from your images, and then train a very simple

 991
 00:55:28,280 --> 00:55:30,780
 classifier head on top of it with just a few examples,

 992
 00:55:30,780 --> 00:55:33,960
 and it will likely solve your vision problem.

 993
 00:55:33,960 --> 00:55:36,320
 An example of this is, I'm not gonna bore you

 994
 00:55:36,320 --> 00:55:39,200
 with tables of results, but example of this

 995
 00:55:39,200 --> 00:55:42,180
 is a collaborator at Meta, Camille Coupri,

 996
 00:55:42,180 --> 00:55:47,020
 who got satellite imaging images of the entire world,

 997
 00:55:47,020 --> 00:55:50,020
 you know, in various frequency bands.

 998
 00:55:50,020 --> 00:55:52,020
 And she also got LiDAR data.

 999
 00:55:52,020 --> 00:55:55,020
 So the LiDAR data gives you, for a little piece of the world,

 1000
 00:55:55,020 --> 00:56:00,020
 LiDAR data gives you the height of the canopy of vegetation.

 1001
 00:56:00,020 --> 00:56:05,020
 And so she took the Dino features, applied them to the entire world,

 1002
 00:56:05,020 --> 00:56:09,020
 and then used a trained classifier that was trained on the LiDAR data,

 1003
 00:56:09,020 --> 00:56:12,020
 on the small amount of data, but applied it to the entire world.

 1004
 00:56:12,020 --> 00:56:16,020
 And now what she has is an estimate of the height of the canopy for the entire Earth.

 1005
 00:56:16,020 --> 00:56:23,220
 What that allows to compute is the an estimate of the amount of carbon captured in vegetation,

 1006
 00:56:23,220 --> 00:56:29,220
 which is a very interesting piece of data for climate change. So that's an example. There's

 1007
 00:56:29,220 --> 00:56:34,340
 other examples in medical imaging, in biological imaging, where Dino has been used for some success.

 1008
 00:56:35,060 --> 00:56:39,940
 But this distillation method called IGEPA that I briefly described earlier works extremely well

 1009
 00:56:39,940 --> 00:56:45,620
 to learn visual features. Again, I'm not going to bore you with details. It's really much better than

 1010
 00:56:45,620 --> 00:56:48,560
 and the methods that are based on reconstruction.

 1011
 00:56:48,560 --> 00:56:52,860
 Of course, the next thing we did was try to apply this to video.

 1012
 00:56:52,860 --> 00:56:54,120
 Can we apply this to video?

 1013
 00:56:54,120 --> 00:56:56,360
 So it turns out if you train a system of this type to make

 1014
 00:56:56,360 --> 00:56:57,660
 temporal prediction in video,

 1015
 00:56:57,660 --> 00:56:58,880
 it doesn't work very well.

 1016
 00:56:58,880 --> 00:57:02,420
 You have to make it do spatial prediction,

 1017
 00:57:02,420 --> 00:57:04,000
 which is very strange.

 1018
 00:57:04,000 --> 00:57:06,840
 There, the features that are learned are really great.

 1019
 00:57:06,840 --> 00:57:10,640
 You get good performance for that system when you use the

 1020
 00:57:10,640 --> 00:57:13,560
 representation to classify actions in

 1021
 00:57:13,560 --> 00:57:16,060
 in videos and things of that type.

 1022
 00:57:17,120 --> 00:57:21,540
 We even have tests now that the paper is being completed

 1023
 00:57:21,540 --> 00:57:24,520
 that show that those systems have some level of common sense

 1024
 00:57:24,520 --> 00:57:25,460
 and physical intuition.

 1025
 00:57:25,460 --> 00:57:27,880
 It shows them videos that are impossible because,

 1026
 00:57:27,880 --> 00:57:30,260
 for example, an object spontaneously disappears

 1027
 00:57:30,260 --> 00:57:31,300
 or something like that.

 1028
 00:57:31,300 --> 00:57:32,940
 They say, whoa, something strange happened.

 1029
 00:57:32,940 --> 00:57:34,160
 Their prediction error goes up.

 1030
 00:57:34,160 --> 00:57:37,660
 And so those systems really are able to learn

 1031
 00:57:37,660 --> 00:57:39,960
 some basic concepts about the world.

 1032
 00:57:39,960 --> 00:57:50,280
 But then the last thing I want to say is systems of this type that are capable of, that basically

 1033
 00:57:50,280 --> 00:57:53,240
 we can use to train a world model and we can use those world models for planning.

 1034
 00:57:53,240 --> 00:57:54,240
 So this is new.

 1035
 00:57:54,240 --> 00:57:57,240
 I haven't presented this yet.

 1036
 00:57:57,240 --> 00:58:03,760
 The paper has been submitted, but this is the first time I talk publicly in English about

 1037
 00:58:03,760 --> 00:58:04,760
 it.

 1038
 00:58:09,960 --> 00:58:16,680
 the preview. So this is work by a student, PhD student NYU,

 1039
 00:58:16,680 --> 00:58:21,880
 Gauru Eijou, who is co-advised by Masef and Lara Pinto, and she did a lot of this work

 1040
 00:58:21,880 --> 00:58:31,080
 while she was an intern at Meta, and Hengai Pan, who's also a student. And the basic architecture

 1041
 00:58:31,080 --> 00:58:37,240
 here is that we use the features from Dinov2, okay, pre-trained, and we train a world model on

 1042
 00:58:37,240 --> 00:58:39,440
 on top of it, which is action conditioned.

 1043
 00:58:39,440 --> 00:58:44,240
 So basically, we take a picture of the world,

 1044
 00:58:44,240 --> 00:58:46,740
 or the environment, whatever it is,

 1045
 00:58:46,740 --> 00:58:50,540
 and then feed an action that we're going to take in

 1046
 00:58:50,540 --> 00:58:53,240
 that environment and then observe

 1047
 00:58:53,240 --> 00:58:57,540
 the result in the environment in terms of Dino features,

 1048
 00:58:57,540 --> 00:59:00,500
 and then train the predictor to predict

 1049
 00:59:00,500 --> 00:59:03,860
 the representation after the action as

 1050
 00:59:03,860 --> 00:59:05,380
 a function of the input,

 1051
 00:59:05,380 --> 00:59:07,700
 the previous state and the action.

 1052
 00:59:07,700 --> 00:59:10,220
 So the predictor function takes

 1053
 00:59:10,220 --> 00:59:11,860
 the previous state and the action and predicts

 1054
 00:59:11,860 --> 00:59:13,700
 the next state essentially.

 1055
 00:59:13,700 --> 00:59:15,420
 Then once we have that system,

 1056
 00:59:15,420 --> 00:59:18,620
 we can do this optimization procedure I was telling you about,

 1057
 00:59:18,620 --> 00:59:22,460
 to plan a sequence of actions to arrive at a particular result.

 1058
 00:59:22,460 --> 00:59:25,700
 The result is simply a Euclidean distance

 1059
 00:59:25,700 --> 00:59:27,220
 between a predicted state,

 1060
 00:59:27,220 --> 00:59:29,700
 end state, and a target state.

 1061
 00:59:29,700 --> 00:59:32,060
 The way we compute the target state is that we show

 1062
 00:59:32,060 --> 00:59:33,740
 an image to the encoder and we tell it,

 1063
 00:59:33,740 --> 00:59:37,220
 you know, this representation is your target representation.

 1064
 00:59:37,220 --> 00:59:40,060
 Take a sequence of actions so that the predicted state

 1065
 00:59:40,060 --> 00:59:42,540
 matches that state.

 1066
 00:59:43,480 --> 00:59:45,360
 So we've tried this on several tasks.

 1067
 00:59:45,360 --> 00:59:46,940
 So one of them is just, you know,

 1068
 00:59:46,940 --> 00:59:49,620
 moving a dot through a simple maze.

 1069
 00:59:49,620 --> 00:59:52,260
 Another one is moving a little,

 1070
 00:59:52,260 --> 00:59:53,500
 let me repeat this video,

 1071
 00:59:54,760 --> 00:59:59,260
 moving a little T object by pushing on it in various places

 1072
 00:59:59,260 --> 01:00:01,180
 so that it's in a particular position.

 1073
 01:00:01,180 --> 01:00:02,640
 That's called a push T problem.

 1074
 01:00:02,640 --> 01:00:07,640
 And then other task of navigating through the environment,

 1075
 01:00:07,640 --> 01:00:09,200
 going through a door in a wall,

 1076
 01:00:09,200 --> 01:00:12,480
 and then pushing on sort of deformable objects

 1077
 01:00:12,480 --> 01:00:14,220
 so they adopt a particular shape.

 1078
 01:00:14,220 --> 01:00:16,100
 Okay, and I'll show you a more impressive example

 1079
 01:00:16,100 --> 01:00:16,860
 in this one.

 1080
 01:00:16,860 --> 01:00:20,660
 Okay, so the task, we can collect artificial data

 1081
 01:00:20,660 --> 01:00:23,760
 because those are virtual environments

 1082
 01:00:23,760 --> 01:00:25,160
 that we can simulate.

 1083
 01:00:25,160 --> 01:00:26,780
 And then we experimented with various systems

 1084
 01:00:26,780 --> 01:00:30,640
 that have been proposed in the past to solve that problem.

 1085
 01:00:30,640 --> 01:00:36,000
 Dreamer V3 is probably one of the most advanced one from DeepMind,

 1086
 01:00:36,000 --> 01:00:39,000
 from Danny R. Hefner at DeepMind.

 1087
 01:00:39,000 --> 01:00:42,200
 And what you see here is visualization through

 1088
 01:00:42,200 --> 01:00:45,600
 a decoder of the predicted state for a sequence of actions.

 1089
 01:00:45,600 --> 01:00:47,240
 So at the top is a ground truth.

 1090
 01:00:47,240 --> 01:00:53,240
 You execute a sequence of actions and see the result in the simulator.

 1091
 01:00:53,240 --> 01:00:58,280
 And then each row is the result of a prediction by one of those models.

 1092
 01:00:58,280 --> 01:01:01,280
 And what you see is some predictions become blurry,

 1093
 01:01:01,280 --> 01:01:04,280
 some predictions become kind of weird.

 1094
 01:01:04,280 --> 01:01:08,280
 Ours is pretty good, Iris is pretty good,

 1095
 01:01:08,280 --> 01:01:12,280
 Dreamer v3 not so great.

 1096
 01:01:12,280 --> 01:01:14,280
 This is the most interesting task.

 1097
 01:01:14,280 --> 01:01:17,280
 It's called the granular environment,

 1098
 01:01:17,280 --> 01:01:21,280
 and it's basically a bunch of blue chips on the table.

 1099
 01:01:21,280 --> 01:01:24,280
 And an action is a motion by a robot arm,

 1100
 01:01:24,280 --> 01:01:26,280
 which goes down on the table,

 1101
 01:01:26,280 --> 01:01:29,740
 moves by some Delta X, Delta Y, and then lifts.

 1102
 01:01:29,740 --> 01:01:31,620
 That's an action, it's four numbers.

 1103
 01:01:31,620 --> 01:01:38,520
 X, Y, where you touch the table, Delta X, Delta Y, lift.

 1104
 01:01:38,520 --> 01:01:41,600
 Okay. The question is,

 1105
 01:01:41,600 --> 01:01:45,000
 so you can train a world model by just putting

 1106
 01:01:45,000 --> 01:01:47,180
 a bunch of chips in random position and then taking

 1107
 01:01:47,180 --> 01:01:49,000
 a random action and then observing the result,

 1108
 01:01:49,000 --> 01:01:50,980
 and you train the predictor this way.

 1109
 01:01:50,980 --> 01:01:53,960
 Once the predictor is trained,

 1110
 01:01:53,960 --> 01:01:58,280
 So those are results of various techniques of planning.

 1111
 01:01:58,280 --> 01:02:00,960
 So you can use the world model for planning a sequence of

 1112
 01:02:00,960 --> 01:02:03,400
 actions to arrive at a particular goal.

 1113
 01:02:03,400 --> 01:02:05,680
 So this is for a point-based Christian world,

 1114
 01:02:05,680 --> 01:02:08,380
 but you might want to look at the other one, the granular.

 1115
 01:02:08,380 --> 01:02:14,580
 So this is the, what's called a chamfer distance between

 1116
 01:02:14,580 --> 01:02:22,900
 the end state in the image space of all the grains, if you want,

 1117
 01:02:22,900 --> 01:02:27,320
 and the target measured through a chamfer distance.

 1118
 01:02:27,320 --> 01:02:29,080
 And what you see is the, our method,

 1119
 01:02:29,080 --> 01:02:30,340
 which is the blue one,

 1120
 01:02:30,340 --> 01:02:32,740
 has much, much lower final error

 1121
 01:02:32,740 --> 01:02:34,760
 than the other methods that we compared it with,

 1122
 01:02:34,760 --> 01:02:37,180
 Dreamer v3 and TDNPC2.

 1123
 01:02:37,180 --> 01:02:40,660
 And TDNPC2 is a method that actually requires,

 1124
 01:02:40,660 --> 01:02:42,100
 needs to be task specific,

 1125
 01:02:42,100 --> 01:02:45,400
 so it's not as general as Dino World Model.

 1126
 01:02:46,700 --> 01:02:49,820
 So here's a little demo of the system in action

 1127
 01:02:49,820 --> 01:02:52,120
 for the various tasks.

 1128
 01:02:52,120 --> 01:02:53,800
 Let me play this again.

 1129
 01:02:53,800 --> 01:02:55,380
 Look at the push T.

 1130
 01:02:55,380 --> 01:03:00,380
 Okay, so you see the dot moving in discrete steps

 1131
 01:03:01,580 --> 01:03:04,840
 because for every tick of the simulation,

 1132
 01:03:04,840 --> 01:03:07,540
 there is the same action is repeated five times.

 1133
 01:03:07,540 --> 01:03:09,560
 So the actions are only produced

 1134
 01:03:09,560 --> 01:03:11,220
 like every five time steps.

 1135
 01:03:11,220 --> 01:03:13,240
 But it gets to the target.

 1136
 01:03:13,240 --> 01:03:17,480
 The target is represented on the right,

 1137
 01:03:17,480 --> 01:03:19,300
 and it actually kind of presents.

 1138
 01:03:19,300 --> 01:03:22,500
 So this is for the granular in particular.

 1139
 01:03:22,500 --> 01:03:26,400
 So the target is represented at the right.

 1140
 01:03:26,400 --> 01:03:28,600
 And let me play this again.

 1141
 01:03:28,600 --> 01:03:32,020
 We start from a random configuration of the chips,

 1142
 01:03:32,020 --> 01:03:33,620
 and the system kind of pushes

 1143
 01:03:33,620 --> 01:03:35,080
 the chips using those actions.

 1144
 01:03:35,080 --> 01:03:35,980
 You don't see the actions,

 1145
 01:03:35,980 --> 01:03:38,940
 but you only see the result by pushing

 1146
 01:03:38,940 --> 01:03:40,220
 them so that they look like a square.

 1147
 01:03:40,220 --> 01:03:42,380
 Now what's interesting about this is that it's

 1148
 01:03:42,380 --> 01:03:43,680
 completely open loop.

 1149
 01:03:43,680 --> 01:03:48,300
 So the system basically looks at the initial condition,

 1150
 01:03:48,300 --> 01:03:49,820
 imagine the sequence of actions,

 1151
 01:03:49,820 --> 01:03:52,280
 and then executes those actions blindly.

 1152
 01:03:52,280 --> 01:03:54,080
 What you see here is a result of

 1153
 01:03:54,080 --> 01:03:56,500
 executing those actions, open loop,

 1154
 01:03:56,500 --> 01:03:58,360
 closing your eyes.

 1155
 01:03:58,360 --> 01:04:00,360
 It's pretty cool.

 1156
 01:04:00,360 --> 01:04:03,220
 All right, coming to the end now.

 1157
 01:04:03,220 --> 01:04:07,160
 So I have five recommendations.

 1158
 01:04:07,160 --> 01:04:12,180
 Abandoned generative models in favor of those JEPA.

 1159
 01:04:12,180 --> 01:04:14,580
 Abandoned probabilistic models in favor of

 1160
 01:04:14,580 --> 01:04:15,620
 those energy-based models.

 1161
 01:04:15,620 --> 01:04:17,900
 So something I haven't said is that in this context,

 1162
 01:04:17,900 --> 01:04:20,460
 you can't really do probabilistic modeling,

 1163
 01:04:20,460 --> 01:04:21,360
 it's intractable.

 1164
 01:04:22,640 --> 01:04:24,820
 Abandon contrastive methods

 1165
 01:04:24,820 --> 01:04:28,720
 in favor of those regularized methods.

 1166
 01:04:28,720 --> 01:04:30,200
 And of course, abandon reinforcement learning,

 1167
 01:04:30,200 --> 01:04:32,140
 but that I've been saying for 10 years.

 1168
 01:04:33,480 --> 01:04:36,180
 And so if you're interested in human level AI,

 1169
 01:04:36,180 --> 01:04:37,940
 don't work on LLMs.

 1170
 01:04:37,940 --> 01:04:40,800
 You're a grad student, you're studying a PhD in AI,

 1171
 01:04:40,800 --> 01:04:42,220
 do not work on LLMs.

 1172
 01:04:44,180 --> 01:04:45,240
 It's not interesting.

 1173
 01:04:45,240 --> 01:04:51,640
 I mean, first of all, it's not that interesting because it's not going to be the next revolution in AI.

 1174
 01:04:51,640 --> 01:04:55,640
 It's not going to help systems understand the physical world and everything.

 1175
 01:04:55,640 --> 01:05:05,840
 But it's also a very dangerous thing to do because there is enormous teams in industry with billions of dollars of resources working on this.

 1176
 01:05:05,840 --> 01:05:09,040
 There's nothing you can bring to the table. Absolutely nothing.

 1177
 01:05:15,240 --> 01:05:20,780
 working on LLMs, but the lifetime of this is going to be three years.

 1178
 01:05:20,780 --> 01:05:26,420
 Three, five years from now, my prediction is no one in their right mind would use LLMs

 1179
 01:05:26,420 --> 01:05:27,880
 in the form that they exist today.

 1180
 01:05:27,880 --> 01:05:30,360
 I mean, they would be used as a component of a bigger system,

 1181
 01:05:30,360 --> 01:05:33,820
 but the main architecture would be different.

 1182
 01:05:35,760 --> 01:05:38,320
 There's a lot of problems to solve with this,

 1183
 01:05:38,320 --> 01:05:41,700
 which I kind of sweat under the rug,

 1184
 01:05:41,700 --> 01:05:43,700
 and I'm not going to go through the laundry list,

 1185
 01:05:43,700 --> 01:05:45,740
 but we don't know how to do hierarchical planning,

 1186
 01:05:45,740 --> 01:05:47,880
 for example. So here is a good PhD topic,

 1187
 01:05:47,880 --> 01:05:49,520
 if you're interested in this.

 1188
 01:05:49,520 --> 01:05:54,240
 Just try to crack the nut of hierarchical planning.

 1189
 01:05:56,340 --> 01:05:59,340
 There's all kinds of foundation,

 1190
 01:05:59,340 --> 01:06:01,720
 theoretical issues with what I talked about here,

 1191
 01:06:01,720 --> 01:06:03,600
 and energy-based models and things like this.

 1192
 01:06:03,600 --> 01:06:05,840
 How to design objectives for SSL so

 1193
 01:06:05,840 --> 01:06:08,600
 that the systems are driven to learn the right thing.

 1194
 01:06:08,600 --> 01:06:11,760
 I've only talked about information maximization,

 1195
 01:06:11,760 --> 01:06:13,560
 but there is all kinds of other things.

 1196
 01:06:13,560 --> 01:06:17,720
 It's a little bit of RL you might need to do for adjusting the world model in real time.

 1197
 01:06:18,840 --> 01:06:24,200
 But then, if we succeed in this program, which may take the better part of the next decade,

 1198
 01:06:25,080 --> 01:06:34,200
 we might have virtual assistant that has human level AI. What I think though, is that those

 1199
 01:06:34,200 --> 01:06:39,080
 platforms need to be open source. And so this is the political part of the talk, which is going to

 1200
 01:06:39,080 --> 01:06:45,480
 be very short. You know, we need, those platforms are incredibly, you know, LLMs or future AI

 1201
 01:06:45,480 --> 01:06:51,800
 systems are incredibly expensive to train, the basic foundation models. So only a few companies

 1202
 01:06:51,800 --> 01:06:58,120
 in the world can do it. And the problem that we're facing now is that the publicly available

 1203
 01:06:58,120 --> 01:07:04,360
 data on the internet is not what we want, because it's mostly English. I mean, there is other

 1204
 01:07:04,360 --> 01:07:09,400
 other languages obviously, but for various reasons, regulatory reasons, all kinds of problems,

 1205
 01:07:09,400 --> 01:07:17,860
 you do not have access to all the data in the world. Of every language in the world,

 1206
 01:07:17,860 --> 01:07:23,740
 there is 4,000 languages or something like that that people use. All the cultures, all

 1207
 01:07:23,740 --> 01:07:31,140
 the value systems, all the centers of interest, you just don't have all the data available.

 1208
 01:07:31,140 --> 01:07:35,740
 So the future is one in which those systems would not be trained by a single company.

 1209
 01:07:35,740 --> 01:07:40,980
 They will be trained in a distributed manner so that you all have big data centers in various

 1210
 01:07:40,980 --> 01:07:41,980
 parts of the world.

 1211
 01:07:41,980 --> 01:07:47,640
 They have access to local data, but they all contribute to training a large model that

 1212
 01:07:47,640 --> 01:07:54,140
 will be worldwide and will eventually constitute the repository of all human knowledge.

 1213
 01:07:54,140 --> 01:07:58,460
 This is a very lofty goal to try to attain, right?

 1214
 01:07:58,460 --> 01:08:02,040
 Having a system that basically constitutes a repository of all human knowledge, but it's

 1215
 01:08:02,040 --> 01:08:06,900
 a system you can talk to, you can ask questions to, it can serve as a tutor, as a professor

 1216
 01:08:06,900 --> 01:08:13,140
 maybe, put a lot of us here at our job.

 1217
 01:08:13,140 --> 01:08:15,460
 It's a thing that we should really work towards.

 1218
 01:08:15,460 --> 01:08:21,600
 It will amplify human intelligence, improve rational thought perhaps.

 1219
 01:08:21,600 --> 01:08:23,080
 But it needs to be diverse also.

 1220
 01:08:28,460 --> 01:08:31,060
 a handful of companies on the West Coast of the US.

 1221
 01:08:31,060 --> 01:08:32,260
 That's completely unacceptable

 1222
 01:08:32,260 --> 01:08:34,060
 to a lot of governments in the world,

 1223
 01:08:35,060 --> 01:08:37,040
 democratic governments, right?

 1224
 01:08:37,040 --> 01:08:39,680
 You need a diversity of AI assistance

 1225
 01:08:39,680 --> 01:08:41,420
 for the same reason you need a diversity

 1226
 01:08:41,420 --> 01:08:44,720
 of newspapers, magazines and the press.

 1227
 01:08:44,720 --> 01:08:47,360
 You need a free press with diversity.

 1228
 01:08:48,380 --> 01:08:51,980
 And we need free AI with diversity as well.

 1229
 01:08:58,460 --> 01:09:01,780
 in AI, some of them are worried about the dangers

 1230
 01:09:01,780 --> 01:09:04,880
 of making AI technology available to everyone.

 1231
 01:09:04,880 --> 01:09:09,100
 I think the benefits far outweigh the dangers and the risks.

 1232
 01:09:10,040 --> 01:09:13,940
 In fact, I think the main risk of AI in the future

 1233
 01:09:13,940 --> 01:09:17,720
 is will happen if AI is controlled

 1234
 01:09:17,720 --> 01:09:19,900
 by a small number of commercial companies

 1235
 01:09:19,900 --> 01:09:22,900
 that don't reveal how their AI systems work.

 1236
 01:09:22,900 --> 01:09:24,340
 I think that's very dangerous.

 1237
 01:09:24,340 --> 01:09:33,940
 So attempts to minimize the risk of AI by basically making open source AI illegal,

 1238
 01:09:33,940 --> 01:09:39,700
 I think it completely misdirected and it will actually reach the opposite result of the intended one.

 1239
 01:09:39,700 --> 01:09:42,580
 It will make AI less safe.

 1240
 01:09:42,580 --> 01:09:50,780
 So open research, open source AI must not be regulated out of existence.

 1241
 01:09:50,780 --> 01:09:53,460
 A lot of politicians need to understand this.

 1242
 01:09:53,660 --> 01:09:57,700
 There's an alliance of various companies that are really kind of subscribed to this model,

 1243
 01:09:57,700 --> 01:10:03,620
 Meta, IBM, Intel, Sony, a lot of people in academia, a lot of startups, venture capitalists,

 1244
 01:10:03,620 --> 01:10:10,800
 etc. And then a few companies who are kind of advocating for the opposite. That will

 1245
 01:10:10,800 --> 01:10:18,820
 remain nameless. So, you know, perhaps if we do it right, we'll have systems that will

 1246
 01:10:18,820 --> 01:10:23,420
 amplify human intelligence, as I was saying at the beginning of the talk. And this may

 1247
 01:10:23,420 --> 01:10:29,980
 Bring about a new renaissance for humanity, you know, similar to what happened with the printing press in the 15th century.

 1248
 01:10:30,800 --> 01:10:35,180
 And on this cosmic conclusion, I will thank you very much.

 1249
 01:10:47,380 --> 01:10:50,720
 And by the way, these are pictures I took from my backyard in New Jersey.

 1250
 01:10:50,720 --> 01:10:59,040
 Thank you, Jan. So Jan will take a few questions now. And for people who are leaving,

 1251
 01:10:59,320 --> 01:11:04,760
 please leave from the Broadway entrance. Do not leave from the campus entrance. But yeah,

 1252
 01:11:04,760 --> 01:11:10,820
 questions? Please line up on the mics if you have questions.

 1253
 01:11:20,720 --> 01:11:29,260
 No sound.

 1254
 01:11:30,400 --> 01:11:31,240
 Yeah, it works.

 1255
 01:11:38,760 --> 01:11:39,320
 Hi.

 1256
 01:11:40,200 --> 01:11:42,260
 Ayaan, thank you for coming so much.

 1257
 01:11:42,900 --> 01:11:46,960
 I wanted to ask for 3D vision models,

 1258
 01:11:46,960 --> 01:11:48,600
 what do you see business applications

 1259
 01:11:48,600 --> 01:11:50,180
 in the next seven, eight years?

 1260
 01:11:50,180 --> 01:11:56,240
 Yeah, I haven't talked about 3D.

 1261
 01:11:56,240 --> 01:12:01,220
 I mean, some of my colleagues think there is something very special about 3D.

 1262
 01:12:01,220 --> 01:12:03,320
 I don't necessarily think that's the case.

 1263
 01:12:03,320 --> 01:12:08,840
 I mean, we're hoping that the next generation of these VJPAD models will basically understand

 1264
 01:12:08,840 --> 01:12:12,720
 the fact that the world is three-dimensional and there are objects in front of others and

 1265
 01:12:12,720 --> 01:12:13,720
 things like that.

 1266
 01:12:13,720 --> 01:12:19,580
 Now, there are applications for which you need 3D inference and reconstruction in 3D

 1267
 01:12:19,580 --> 01:12:22,520
 If you want to have virtual objects in virtual environments

 1268
 01:12:22,520 --> 01:12:24,500
 and things like this.

 1269
 01:12:24,500 --> 01:12:26,780
 But frankly, I'm not a specialist.

 1270
 01:12:26,780 --> 01:12:29,000
 I think there are specialists of that question here

 1271
 01:12:29,000 --> 01:12:31,380
 at Columbia, actually.

 1272
 01:12:31,380 --> 01:12:32,700
 Just one more question.

 1273
 01:12:32,700 --> 01:12:37,100
 Do you really see that VJEPA models and DinoV2

 1274
 01:12:37,100 --> 01:12:40,080
 having hierarchical planning like the kind you mentioned

 1275
 01:12:40,080 --> 01:12:41,240
 earlier?

 1276
 01:12:41,240 --> 01:12:43,740
 So it doesn't exist yet.

 1277
 01:12:43,740 --> 01:12:47,340
 So this is something we're working on.

 1278
 01:12:47,340 --> 01:12:52,780
 I hope we will get some results about this for you know in the next year or two something like that.

 1279
 01:12:53,580 --> 01:13:01,660
 Thank you so much. Okay one question here. You talked about sorry

 1280
 01:13:06,220 --> 01:13:11,900
 you talked about the benefits of AI and you think it's more beneficial than there are risks to it

 1281
 01:13:17,340 --> 01:13:25,900
 West Coast, control the most advanced models. So why do you feel that the benefits outweigh the risks?

 1282
 01:13:25,900 --> 01:13:32,060
 So that's not entirely true. Meta actually does not subscribe to this model that AI should be

 1283
 01:13:32,060 --> 01:13:38,620
 proprietary and kept in its own hands. It releases a series of models called LAMA, right? So LAMA 1,

 1284
 01:13:38,620 --> 01:13:45,660
 2, 3, 3.1, 3.2, which are state of the art or really close to it or better in certain measures.

 1285
 01:13:45,660 --> 01:13:51,900
 And this is open source. It can be used freely by a lot of people around the world. It can be

 1286
 01:13:51,900 --> 01:14:01,740
 fine-tuned for various languages or vertical applications. And it's...LAMA 3 has been

 1287
 01:14:01,740 --> 01:14:06,140
 downloaded, I think, 400 million times or something like this. It's just insane. And

 1288
 01:14:06,140 --> 01:14:15,580
 every single company I talk to has either deployed it or is about to deploy products based on LAMA.

 1289
 01:14:15,580 --> 01:14:22,580
 There are people in Africa who are using it and training it to provide medical assistance, for example.

 1290
 01:14:22,580 --> 01:14:31,580
 There's people in India that Mita is collaborating with so that future versions of Lama will speak all 22 official languages of India,

 1291
 01:14:31,580 --> 01:14:34,580
 and perhaps at some point all the 1500 dialects or whatever.

 1292
 01:14:34,580 --> 01:14:40,580
 So, you know, I think that's the way to make AI widely accessible to everyone in the world.

 1293
 01:14:40,580 --> 01:14:44,180
 I mean, I'm really happy to be part of that effort.

 1294
 01:14:44,180 --> 01:14:47,800
 I really wouldn't like to be part of kind of a closed effort.

 1295
 01:14:51,840 --> 01:14:52,340
 Hi, Yan.

 1296
 01:14:52,340 --> 01:14:53,820
 My name is Srikant.

 1297
 01:14:53,820 --> 01:14:55,560
 I want to ask you, I'm curious to know

 1298
 01:14:55,560 --> 01:14:58,400
 what you think about the capabilities of time series

 1299
 01:14:58,400 --> 01:15:01,460
 foundation models, because I see that Amazon, Google,

 1300
 01:15:01,460 --> 01:15:04,400
 Meta, everyone's trying to work in that domain.

 1301
 01:15:04,400 --> 01:15:07,100
 But to me, intuitively, it feels like time series predictions

 1302
 01:15:07,100 --> 01:15:09,760
 are a harder problem than language modeling.

 1303
 01:15:09,760 --> 01:15:12,860
 What are your thoughts on the capabilities and limitations on this?

 1304
 01:15:12,860 --> 01:15:14,060
 Yeah, okay.

 1305
 01:15:14,060 --> 01:15:18,580
 I think you put your finger on an important point, which I forgot to mention.

 1306
 01:15:18,580 --> 01:15:24,580
 The reason why language modeling works, why those predictive models that predict the next

 1307
 01:15:24,580 --> 01:15:27,980
 word, the reason why they work for natural language and they don't work for images and

 1308
 01:15:27,980 --> 01:15:31,680
 video, for example, is because language is discrete.

 1309
 01:15:31,680 --> 01:15:38,280
 So to represent an uncertainty in the prediction when you have a discrete choice with a few

 1310
 01:15:38,280 --> 01:15:40,280
 It's easy.

 1311
 01:15:40,280 --> 01:15:45,100
 You just produce a distribution, probably distribution of all the possible outcomes.

 1312
 01:15:45,100 --> 01:15:46,100
 And this is how LLMs work.

 1313
 01:15:46,100 --> 01:15:47,100
 They are trained.

 1314
 01:15:47,100 --> 01:15:51,160
 They actually produce a distribution over the next token.

 1315
 01:15:51,160 --> 01:15:57,300
 You can't do this with continuous variables, particularly high dimensional continuous variables

 1316
 01:15:57,300 --> 01:16:00,180
 like video pixels.

 1317
 01:16:00,180 --> 01:16:06,480
 So there, we're not able to represent distributions efficiently in high dimensional continuous

 1318
 01:16:06,480 --> 01:16:10,800
 spaces beyond like simple ones like Gaussians, right?

 1319
 01:16:10,800 --> 01:16:15,720
 So my answer to this is don't do it.

 1320
 01:16:15,720 --> 01:16:18,100
 Do prediction in representation space.

 1321
 01:16:18,100 --> 01:16:20,160
 And then if you need to have actual prediction

 1322
 01:16:20,160 --> 01:16:22,540
 of the time series, have a decoder that does that

 1323
 01:16:22,540 --> 01:16:23,100
 separately.

 1324
 01:16:23,100 --> 01:16:26,100
 But actually training a system to predict

 1325
 01:16:26,100 --> 01:16:28,680
 high dimensional continuous thing by regression

 1326
 01:16:28,680 --> 01:16:32,380
 when you have uncertainty simply doesn't work.

 1327
 01:16:32,380 --> 01:16:34,760
 That's the evidence we have by trying to,

 1328
 01:16:34,760 --> 01:16:38,200
 There was a huge project at Meta called Video MAE.

 1329
 01:16:38,200 --> 01:16:40,280
 So the idea was, you know, take a video,

 1330
 01:16:40,280 --> 01:16:41,920
 max some parts of it,

 1331
 01:16:41,920 --> 01:16:43,300
 and then train some gigantic neural net

 1332
 01:16:43,300 --> 01:16:45,060
 to predict the parts that are missing.

 1333
 01:16:45,060 --> 01:16:46,520
 It was complete failure.

 1334
 01:16:46,520 --> 01:16:49,700
 We abandoned that project.

 1335
 01:16:49,700 --> 01:16:52,440
 We canceled it, because it was going nowhere, okay?

 1336
 01:16:52,440 --> 01:16:54,860
 And this was really very large scale.

 1337
 01:16:54,860 --> 01:16:56,860
 A lot of computing resources were devoted to this.

 1338
 01:16:56,860 --> 01:16:58,580
 It just didn't work.

 1339
 01:16:58,580 --> 01:17:01,040
 The J-Path stuff, though, does work.

 1340
 01:17:01,040 --> 01:17:03,660
 So my hunch is that for time series,

 1341
 01:17:03,660 --> 01:17:07,900
 It's probably a way to use kind of similar idea.

 1342
 01:17:07,900 --> 01:17:08,780
 SPEAKER 1

 1343
 01:17:08,780 --> 01:17:09,380
 OK, thank you.

 1344
 01:17:12,580 --> 01:17:12,620
 SPEAKER 1

 1345
 01:17:12,620 --> 01:17:14,540
 Great talk.

 1346
 01:17:14,540 --> 01:17:17,200
 So my question is, I think I agree with your framework

 1347
 01:17:17,200 --> 01:17:18,840
 for you have some world model and you

 1348
 01:17:18,840 --> 01:17:20,900
 want to optimize via that world model

 1349
 01:17:20,900 --> 01:17:22,200
 and how you train the world model.

 1350
 01:17:22,200 --> 01:17:24,980
 But my question is, how do you get intelligence

 1351
 01:17:24,980 --> 01:17:28,720
 when the world model is inconsistent with the truth?

 1352
 01:17:28,720 --> 01:17:31,440
 So as an example, let's say your world model only

 1353
 01:17:31,440 --> 01:17:33,240
 has classical mechanics.

 1354
 01:17:33,240 --> 01:17:35,460
 how do you discover special relativity?

 1355
 01:17:35,460 --> 01:17:38,220
 Humans have somehow broken that boundary,

 1356
 01:17:38,220 --> 01:17:39,440
 but I don't know how you do that

 1357
 01:17:39,440 --> 01:17:42,360
 when your world model is only based on observed data.

 1358
 01:17:43,320 --> 01:17:45,260
 Well, I mean, the type of world model

 1359
 01:17:45,260 --> 01:17:46,840
 we're talking about here is,

 1360
 01:17:48,120 --> 01:17:50,700
 what I would be happy with before I retire

 1361
 01:17:50,700 --> 01:17:52,800
 or before my brain turns into a bitch-a-mail sauce

 1362
 01:17:52,800 --> 01:17:57,800
 is world models that are of the level of complexity

 1363
 01:17:58,020 --> 01:18:01,860
 of a cat's world model, right, of the physical world,

 1364
 01:18:01,860 --> 01:18:03,400
 Which is pretty sophisticated actually.

 1365
 01:18:03,400 --> 01:18:06,260
 I mean, you can plan really complex actions.

 1366
 01:18:06,260 --> 01:18:07,360
 So that's what we're talking about.

 1367
 01:18:07,360 --> 01:18:09,400
 Now, you put your finger on something

 1368
 01:18:09,400 --> 01:18:11,340
 that's really interesting, which is that,

 1369
 01:18:12,540 --> 01:18:15,940
 which is a philosophical motivation behind JEPA,

 1370
 01:18:15,940 --> 01:18:20,140
 and this idea that you need to lift the abstraction level

 1371
 01:18:21,580 --> 01:18:23,340
 to be able to make predictions, right?

 1372
 01:18:24,540 --> 01:18:27,420
 You cannot make predictions at the level of observation.

 1373
 01:18:27,420 --> 01:18:31,620
 You have to find a good representation of reality

 1374
 01:18:31,620 --> 01:18:33,260
 within which you can make predictions.

 1375
 01:18:33,260 --> 01:18:35,800
 And that's the hardest problem really,

 1376
 01:18:35,800 --> 01:18:37,620
 is to find that good representation space

 1377
 01:18:37,620 --> 01:18:38,940
 that allows you to make predictions.

 1378
 01:18:38,940 --> 01:18:40,480
 We do this all the time in science.

 1379
 01:18:40,480 --> 01:18:42,900
 We do this all the time in everyday life without realizing,

 1380
 01:18:42,900 --> 01:18:45,160
 but we do this all the time in science.

 1381
 01:18:46,620 --> 01:18:47,760
 If we didn't need to do this,

 1382
 01:18:47,760 --> 01:18:52,760
 we could explain human society with quantum field theory.

 1383
 01:18:54,180 --> 01:18:55,020
 Right?

 1384
 01:18:55,020 --> 01:18:55,860
 Right.

 1385
 01:18:55,860 --> 01:18:56,880
 But we can't, right?

 1386
 01:18:56,880 --> 01:19:00,340
 Because the gap, you know, in abstraction is so large, right?

 1387
 01:19:00,340 --> 01:19:05,380
 So we go from quantum field theory to particle physics and from particles to atoms and from

 1388
 01:19:05,380 --> 01:19:11,060
 atoms to molecules, from molecules to materials and from chemistry and you know blah blah blah

 1389
 01:19:11,060 --> 01:19:17,780
 right and we go up the chain of abstraction so that at some level we have a representation

 1390
 01:19:17,780 --> 01:19:24,340
 of physical objects and Newtonian mechanics and for you know large scale it would be relativity

 1391
 01:19:30,340 --> 01:19:34,880
 human behavior, animal behavior, ecology, you know, this kind of stuff, right?

 1392
 01:19:34,880 --> 01:19:38,340
 So we have all those levels of representation for which we have the,

 1393
 01:19:38,340 --> 01:19:42,400
 for which the crucial insight is to actually find a representation.

 1394
 01:19:42,400 --> 01:19:45,780
 For example, let's take a planet. Let's take Jupiter, okay?

 1395
 01:19:45,780 --> 01:19:47,700
 Jupiter is an incredibly complex object.

 1396
 01:19:47,700 --> 01:19:52,080
 It's got, you know, complicated composition.

 1397
 01:19:52,080 --> 01:19:55,480
 It's got weather. It's got all kinds of gases swirling around.

 1398
 01:19:55,480 --> 01:20:00,480
 And, you know, very complex object, right?

 1399
 01:20:02,180 --> 01:20:05,840
 Now, who would have thought that the only thing you need

 1400
 01:20:05,840 --> 01:20:10,420
 to predict the trajectory of Jupiter is six numbers?

 1401
 01:20:10,420 --> 01:20:13,300
 You need three position, three velocities,

 1402
 01:20:13,300 --> 01:20:16,480
 and you can predict the trajectory of Jupiter for centuries.

 1403
 01:20:18,460 --> 01:20:19,920
 You know, that's a problem of learning

 1404
 01:20:19,920 --> 01:20:22,140
 a good representation, right?

 1405
 01:20:22,140 --> 01:20:23,740
 So, is the proposal essentially

 1406
 01:20:23,740 --> 01:20:26,500
 to do this hierarchical planning with hierarchical world

 1407
 01:20:26,500 --> 01:20:27,500
 models as well?

 1408
 01:20:27,500 --> 01:20:28,000
 Yeah.

 1409
 01:20:28,000 --> 01:20:28,500
 OK.

 1410
 01:20:28,500 --> 01:20:29,000
 Exactly.

 1411
 01:20:29,000 --> 01:20:29,500
 Awesome.

 1412
 01:20:29,500 --> 01:20:31,900
 Have a system that can build multiple levels of abstractions.

 1413
 01:20:31,900 --> 01:20:32,480
 Great.

 1414
 01:20:32,480 --> 01:20:32,940
 Thanks.

 1415
 01:20:32,940 --> 01:20:36,040
 Which is really the idea behind deep learning, by the way.

 1416
 01:20:36,040 --> 01:20:36,540
 OK.

 1417
 01:20:36,540 --> 01:20:38,120
 We'll have two more questions, then we'll stop.

 1418
 01:20:38,120 --> 01:20:40,600
 So we'll take one from there and one from there.

 1419
 01:20:40,600 --> 01:20:40,880
 Yeah.

 1420
 01:20:40,880 --> 01:20:41,700
 Hi.

 1421
 01:20:41,700 --> 01:20:45,640
 My question is about the one type of generative model

 1422
 01:20:45,640 --> 01:20:49,820
 that you haven't covered, which is the diffusion models, which

 1423
 01:20:49,820 --> 01:20:56,560
 I believe are quite different from the generative models

 1424
 01:20:56,560 --> 01:21:00,200
 that you mentioned, because they are more implicit and

 1425
 01:21:00,200 --> 01:21:04,360
 they don't predict the explicit probability distribution

 1426
 01:21:04,360 --> 01:21:09,180
 like the LMS or VAEs or all the other generative one that you

 1427
 01:21:09,180 --> 01:21:13,880
 mentioned. What are your perspective on the potential of

 1428
 01:21:13,880 --> 01:21:19,820
 those models and especially with it has some attribute

 1429
 01:21:19,820 --> 01:21:26,540
 to hierarchical planning as you said because when you use it for generating image, like

 1430
 01:21:26,540 --> 01:21:32,580
 in the first few time steps, it actually generates like very high level details and then on the

 1431
 01:21:32,580 --> 01:21:37,060
 later time step, it fills in the details, like the smaller details.

 1432
 01:21:37,060 --> 01:21:38,060
 Yeah.

 1433
 01:21:38,060 --> 01:21:39,060
 Okay.

 1434
 01:21:39,060 --> 01:21:41,480
 So diffusion models can be seen as generative or not.

 1435
 01:21:41,480 --> 01:21:45,840
 But the way to understand them, I think, is the following.

 1436
 01:21:45,840 --> 01:21:53,240
 In a space of representation or images or whatever it is, you have, let's say, a manifold

 1437
 01:21:53,240 --> 01:21:56,080
 of data.

 1438
 01:21:56,080 --> 01:22:00,420
 Let's say natural images if you want to train an image generation system.

 1439
 01:22:00,420 --> 01:22:04,860
 Or perhaps representations that are extracted by an encoder of the type that I talked about.

 1440
 01:22:04,860 --> 01:22:10,800
 And those basically is a subset within the full space.

 1441
 01:22:10,800 --> 01:22:14,660
 What a diffusion model does is that you give it a random vector in that space and it will

 1442
 01:22:14,660 --> 01:22:16,980
 bring you back to that manifold.

 1443
 01:22:16,980 --> 01:22:21,800
 Okay, and it will do this by training a vector field

 1444
 01:22:21,800 --> 01:22:26,300
 so that at every location, random location in that space,

 1445
 01:22:26,300 --> 01:22:29,560
 there is a vector that basically takes you back

 1446
 01:22:29,560 --> 01:22:32,900
 to that manifold, perhaps in multiple steps.

 1447
 01:22:32,900 --> 01:22:34,760
 Okay, that's what it does in the end.

 1448
 01:22:34,760 --> 01:22:36,860
 It's trained in a particular way by reversing,

 1449
 01:22:38,960 --> 01:22:43,620
 you know, a noisification chain, but that's what it does.

 1450
 01:22:43,620 --> 01:22:49,740
 Now that's actually a particular way of implementing energy-based models of the types that I described.

 1451
 01:22:49,740 --> 01:22:53,520
 Because you can think of this manifold of data as being kind of the minimum of an energy

 1452
 01:22:53,520 --> 01:22:54,600
 function.

 1453
 01:22:54,600 --> 01:22:58,740
 And if you had an energy function, you could compute the gradient of that energy function,

 1454
 01:22:58,740 --> 01:23:02,900
 that gradient of the energy function will take you back to that manifold.

 1455
 01:23:02,900 --> 01:23:09,020
 So that's the energy-based view of inference or denoising or restoration or whatever you

 1456
 01:23:09,020 --> 01:23:11,760
 want.

 1457
 01:23:11,760 --> 01:23:17,760
 And diffusion models basically instead of having an energy function that you compute

 1458
 01:23:17,760 --> 01:23:22,440
 the gradient of, they directly learn the vector field that basically would be the gradient

 1459
 01:23:22,440 --> 01:23:24,440
 of that energy function.

 1460
 01:23:24,440 --> 01:23:25,620
 That's the way to understand it.

 1461
 01:23:25,620 --> 01:23:27,880
 So it's not disconnected from what I talked about.

 1462
 01:23:27,880 --> 01:23:32,160
 It can be used usefully in the context of what I talked about.

 1463
 01:23:32,160 --> 01:23:34,520
 And what about nature?

 1464
 01:23:34,520 --> 01:23:35,520
 Yeah.

 1465
 01:23:35,520 --> 01:23:38,520
 My name is Leon.

 1466
 01:23:38,520 --> 01:23:40,960
 I really want to thank you for the talk.

 1467
 01:23:40,960 --> 01:23:44,960
 My question was sort of about these world models you were talking about,

 1468
 01:23:44,960 --> 01:23:50,960
 especially in terms of trying to get to actual like cat level or animal type intelligence.

 1469
 01:23:50,960 --> 01:24:01,960
 So like in terms of like a giraffe, as soon as it's born, something is in its mind that lets it be able to run or even walk within moments.

 1470
 01:24:01,960 --> 01:24:07,960
 And I think part of it is because the world model it has constrains the type of actions it takes,

 1471
 01:24:07,960 --> 01:24:14,460
 That kind of thing seems to be what you're almost doing with the dyno of trying to do these rule based approaches.

 1472
 01:24:15,000 --> 01:24:18,900
 I'm just wondering how do these world models evolve over time?

 1473
 01:24:19,040 --> 01:24:22,680
 Like how much variability does it have?

 1474
 01:24:22,680 --> 01:24:30,000
 Yeah, I mean so clearly you need the world model to be adjusted as you go, right?

 1475
 01:24:37,960 --> 01:24:42,200
 particular force to grab it, but then as I grab it, I realize it's not that full, so

 1476
 01:24:42,200 --> 01:24:43,640
 it's lighter.

 1477
 01:24:43,640 --> 01:24:49,580
 I can adjust my role model of that system and then adjust my actions as a function of

 1478
 01:24:49,580 --> 01:24:50,580
 this very quickly.

 1479
 01:24:50,580 --> 01:24:51,580
 It's not learning, actually.

 1480
 01:24:51,580 --> 01:24:53,560
 It's just a few parameter adjustment.

 1481
 01:24:53,560 --> 01:24:57,060
 But in other situations, you need to learn.

 1482
 01:24:57,060 --> 01:25:01,740
 You need to adapt your role model for the situation.

 1483
 01:25:01,740 --> 01:25:06,600
 If you have a powerful role model, you're not going to be able to train it for all possible

 1484
 01:25:06,600 --> 01:25:10,780
 situations and all possible configurations of the world.

 1485
 01:25:10,780 --> 01:25:14,920
 And so there are parts of the state space

 1486
 01:25:14,920 --> 01:25:18,120
 that where your model is gonna be inaccurate.

 1487
 01:25:18,120 --> 01:25:20,780
 And the system, if you want the system to plan accurately,

 1488
 01:25:20,780 --> 01:25:23,580
 it needs to be able to detect when that happens.

 1489
 01:25:23,580 --> 01:25:26,940
 So basically only plan within regions of the space

 1490
 01:25:26,940 --> 01:25:29,720
 where the prediction of its own model is good,

 1491
 01:25:29,720 --> 01:25:31,660
 and then adjust its model as it goes

 1492
 01:25:31,660 --> 01:25:33,800
 if it's not the case.

 1493
 01:25:33,800 --> 01:25:36,400
 That's where you need reinforcement learning basically.

 1494
 01:25:36,400 --> 01:25:38,500
 Can I just ask a clarification question?

 1495
 01:25:38,900 --> 01:25:43,640
 I think there's a lot of understanding of I'm really confident in what I'm able to do,

 1496
 01:25:43,900 --> 01:25:49,640
 but as soon as, let's say, I throw a ball, the physics of that ball is something really unpredictable.

 1497
 01:25:50,100 --> 01:25:52,280
 How would you differentiate that in your world model?

 1498
 01:25:52,520 --> 01:25:53,340
 Are there parameters?

 1499
 01:25:53,520 --> 01:25:57,100
 Yeah, so this is adaptation on the fly of your world model

 1500
 01:25:57,100 --> 01:26:01,080
 or perhaps adjustment of a few latent variables that represent what you don't know about the world,

 1501
 01:26:01,200 --> 01:26:02,960
 like the wind speed and things like that.

 1502
 01:26:02,960 --> 01:26:06,500
 So, I mean, there's various mechanisms for this.

 1503
 01:26:07,780 --> 01:26:09,600
 Okay, let's thank the speaker again.