Skip to content

Instantly share code, notes, and snippets.

@cyysky
Created October 24, 2024 20:11
Show Gist options
  • Save cyysky/bea3f0b82e13333b4a4dafb9376201af to your computer and use it in GitHub Desktop.
Save cyysky/bea3f0b82e13333b4a4dafb9376201af to your computer and use it in GitHub Desktop.
1
00:00:00,000 --> 00:00:10,480
So welcome all. Welcome to this distinguished lecture series in AI. I'm Vishal Mishra. I'm the
2
00:00:10,480 --> 00:00:15,040
Vice Dean for Computing and AI in Columbia Engineering. This is the second lecture in our
3
00:00:15,040 --> 00:00:20,420
series. We seem to have a reasonably full house. People are still streaming in. So before we start,
4
00:00:20,480 --> 00:00:26,020
I'd like to invite Dean Shifu Chang to give some opening remarks. All right. Good morning, everyone.
5
00:00:26,020 --> 00:00:28,020
Welcome to our
6
00:00:28,020 --> 00:00:34,580
It's really exciting.
7
00:00:35,060 --> 00:00:38,560
This is the first time I see we have an overflow space used today.
8
00:00:38,720 --> 00:00:40,980
Really so exciting about the topic and speaker.
9
00:00:41,540 --> 00:00:46,960
I want to thank Michelle and the team for organizing the AI lecture series this semester and throughout the year.
10
00:00:47,320 --> 00:00:51,560
I want to thank our president Katrina Armstrong for coming to support our event today.
11
00:00:51,560 --> 00:00:59,480
And as Vichel mentioned, this is the second in our AI lecture series across the school
12
00:00:59,480 --> 00:01:03,400
and is associated with the university initiative in AI.
13
00:01:03,580 --> 00:01:07,220
That's one of the priorities that President Armstrong is leading us
14
00:01:07,220 --> 00:01:10,060
for the school university-wide effort here.
15
00:01:10,720 --> 00:01:13,640
Last month, we launched this new AI lecture series,
16
00:01:14,160 --> 00:01:16,620
starting with our faculty member, Pierre Gentian,
17
00:01:16,740 --> 00:01:20,480
to talk about how AI can have an impact in different disciplines.
18
00:01:20,480 --> 00:01:23,860
And so last month we launched AI and climate projection.
19
00:01:23,860 --> 00:01:25,320
And today we're so excited,
20
00:01:25,320 --> 00:01:28,220
Dr. Yang Li Kang is here to share his vision,
21
00:01:28,220 --> 00:01:30,440
his insight on very exciting topic.
22
00:01:30,440 --> 00:01:31,960
You have seen his title.
23
00:01:31,960 --> 00:01:36,960
I have seen Yang talking many times in CVPR, ICML,
24
00:01:36,960 --> 00:01:38,220
learning representation,
25
00:01:38,220 --> 00:01:41,780
but today's topic is particularly intriguing.
26
00:01:41,780 --> 00:01:45,120
And his presence, as you can see from audience today,
27
00:01:45,120 --> 00:01:47,420
we have to open up overflow space.
28
00:01:47,420 --> 00:01:49,040
The event once is announced,
29
00:01:49,040 --> 00:01:50,460
Three minutes, salt hot.
30
00:01:50,460 --> 00:01:52,820
You are the lucky ones, okay.
31
00:01:52,820 --> 00:01:55,700
And the lecture series, one of the efforts
32
00:01:55,700 --> 00:01:59,420
around AI and university, we are pursuing advances
33
00:01:59,420 --> 00:02:02,120
in the fundamental area, which is covered
34
00:02:02,120 --> 00:02:03,560
by today's lecture.
35
00:02:03,560 --> 00:02:06,540
We're also pursuing the impact in different discipline
36
00:02:06,540 --> 00:02:10,940
in collaboration among all the 17 schools at Columbia.
37
00:02:10,940 --> 00:02:14,380
Climate, business, finance, policy, journalism, you name it.
38
00:02:14,380 --> 00:02:16,600
So we work with industry, community,
39
00:02:16,600 --> 00:02:22,340
create centers on AI and finance, AI on climate, AI on sports, AI and policy.
40
00:02:22,740 --> 00:02:23,920
So that's our effort today.
41
00:02:24,060 --> 00:02:30,500
We create a new course on AI in context to teach AI in the context of humanity, in literature,
42
00:02:30,760 --> 00:02:32,340
in music, and philosophy.
43
00:02:32,780 --> 00:02:36,460
Today's topic, how could machine reach human-level intelligence?
44
00:02:36,780 --> 00:02:40,380
Just by reading the title makes me so intrigued, so excited.
45
00:02:40,800 --> 00:02:45,580
So without further ado, let me invite Vishal, our Vice Dean of AI and Computing,
46
00:02:45,580 --> 00:02:47,900
to have an introduction of our speaker,
47
00:02:47,900 --> 00:02:49,240
Yang Likang, today.
48
00:02:49,240 --> 00:02:50,080
He's here.
49
00:02:52,820 --> 00:02:53,660
Thanks, Shifu.
50
00:02:55,940 --> 00:02:58,360
So, Yang, of course, needs no introduction.
51
00:03:03,480 --> 00:03:04,660
But just to embarrass him,
52
00:03:04,660 --> 00:03:08,140
I'll give a brief introduction of Yang.
53
00:03:08,140 --> 00:03:11,360
Now, this may come as a surprise to a lot of you,
54
00:03:11,360 --> 00:03:13,280
but it's true,
55
00:03:13,280 --> 00:03:15,620
and you'll never guess it from his accent.
56
00:03:15,620 --> 00:03:17,000
Jan is actually French.
57
00:03:18,080 --> 00:03:22,680
He got his PhD from the Sorbonne in 1987,
58
00:03:22,680 --> 00:03:24,280
and in his PhD thesis,
59
00:03:24,280 --> 00:03:28,020
he proposed an early form of back propagation.
60
00:03:28,020 --> 00:03:29,600
Now back propagation is the way
61
00:03:29,600 --> 00:03:32,540
all neural networks are trained now,
62
00:03:32,540 --> 00:03:36,440
and it sort of started from his PhD thesis.
63
00:03:37,620 --> 00:03:41,460
He joined 8080 Bell Labs in 1988.
64
00:03:41,460 --> 00:03:44,020
Before that, he spent a few months or a year
65
00:03:44,020 --> 00:03:46,920
with Jeff Hinton working as a postdoc.
66
00:03:49,840 --> 00:03:52,180
I, there was an alarm, okay.
67
00:03:52,180 --> 00:03:54,620
And he joined AT&T Bell Labs in 1988.
68
00:03:55,860 --> 00:03:58,040
Next year, he sort of stunned the world
69
00:03:58,040 --> 00:04:00,160
with this handwriting recognition system.
70
00:04:00,160 --> 00:04:01,460
And you'll see a video of that.
71
00:04:11,460 --> 00:04:38,300
.
72
00:04:38,300 --> 00:04:40,500
This was absolutely incredible at that time.
73
00:04:45,300 --> 00:04:47,900
And there you see Jan looking slightly different.
74
00:05:05,420 --> 00:05:08,120
After that came a long AI and neural nets
75
00:05:08,120 --> 00:05:13,120
Winter, Jan joined AT&T Research in 1996,
76
00:05:14,320 --> 00:05:15,400
but he never gave up.
77
00:05:15,400 --> 00:05:19,760
He continued working on convolutional neural network CNNs,
78
00:05:19,760 --> 00:05:23,960
which were what he used for the handwriting recognition system.
79
00:05:23,960 --> 00:05:27,840
Around 2012, the deep learning revolution happened,
80
00:05:27,840 --> 00:05:29,680
and now CNNs are everywhere,
81
00:05:29,680 --> 00:05:31,680
whether his friend Elon Musk's cars,
82
00:05:33,580 --> 00:05:35,460
some people got what I meant,
83
00:05:35,460 --> 00:05:40,460
or Google Photos, everyone uses CNNs.
84
00:05:42,300 --> 00:05:47,300
In 2013, Jan joined Meta AI as the director of their AI lab
85
00:05:48,000 --> 00:05:49,940
and now he is the chief scientist.
86
00:05:49,940 --> 00:05:52,520
In 2018, he also won the Turing Award
87
00:05:52,520 --> 00:05:54,840
along with Jeff Hinton and Yoshua Bengio
88
00:05:56,220 --> 00:06:00,220
for his work in deep learning and artificial intelligence.
89
00:06:00,220 --> 00:06:02,560
In fact, Jeff was here yesterday.
90
00:06:02,560 --> 00:06:04,580
He was on campus and he was walking around
91
00:06:04,580 --> 00:06:06,180
And people were asking him for selfies.
92
00:06:06,180 --> 00:06:08,980
So he wanted to be here.
93
00:06:08,980 --> 00:06:10,660
Unfortunately, something urgent came up,
94
00:06:10,660 --> 00:06:12,200
so he couldn't be here.
95
00:06:12,200 --> 00:06:16,940
So as I mentioned, Jan won the Turing Award in 2013, or 2018.
96
00:06:16,940 --> 00:06:19,860
And this is a Turing Award for computer science,
97
00:06:19,860 --> 00:06:22,580
not for physics or chemistry, which are also known as Nobel
98
00:06:22,580 --> 00:06:24,980
prizes these days.
99
00:06:24,980 --> 00:06:27,080
This is the original one.
100
00:06:27,080 --> 00:06:28,580
And he won the award in 2018.
101
00:06:28,580 --> 00:06:33,220
And he's also big into the selfie game.
102
00:06:33,220 --> 00:06:34,820
I took a selfie with him that day.
103
00:06:36,640 --> 00:06:39,080
And now with that, I'll invite Jan to tell us
104
00:06:39,080 --> 00:06:40,600
about human level intelligence.
105
00:06:48,720 --> 00:06:52,180
Thank you very much for this amazing introduction.
106
00:06:54,180 --> 00:06:56,740
A real pleasure to be here.
107
00:06:56,740 --> 00:06:59,620
The good thing to come give a talk here is that
108
00:07:00,740 --> 00:07:02,000
I didn't have to fly.
109
00:07:02,000 --> 00:07:08,180
Although if you ask people from downtown, they rarely go above 23rd Street.
110
00:07:11,700 --> 00:07:18,540
So, yeah, I mean, I worked really hard to lose my French accent in the last four decades or so,
111
00:07:18,680 --> 00:07:23,660
three and a half decades. But I just recently learned that if you speak English with a French
112
00:07:32,000 --> 00:07:35,320
I should speak with a very strong French accent.
113
00:07:36,120 --> 00:07:40,600
And perhaps, appear intelligent.
114
00:07:40,600 --> 00:07:46,800
Okay. What should appear intelligent is machines,
115
00:07:46,800 --> 00:07:49,320
and they do appear intelligent.
116
00:07:49,320 --> 00:07:52,600
We, a lot of people give them IQ,
117
00:07:52,600 --> 00:07:53,640
whatever that means,
118
00:07:53,640 --> 00:07:56,520
that is actually much higher than they deserve.
119
00:07:56,520 --> 00:08:00,160
We are nowhere near being able to reach
120
00:08:00,160 --> 00:08:03,520
human intelligence or human level intelligence with machines,
121
00:08:03,520 --> 00:08:05,780
what some people call AGI,
122
00:08:05,780 --> 00:08:07,800
Artificial General Intelligence.
123
00:08:07,800 --> 00:08:09,660
I hate that term.
124
00:08:09,660 --> 00:08:13,040
I've been trying to fight against it.
125
00:08:13,040 --> 00:08:16,600
The reason is not that it's impossible for
126
00:08:16,600 --> 00:08:17,880
a machine to reach human intelligence.
127
00:08:17,880 --> 00:08:18,720
Of course, it's possible.
128
00:08:18,720 --> 00:08:20,720
There's no question at some point we'll have
129
00:08:20,720 --> 00:08:23,300
machines that are as intelligent as humans in
130
00:08:23,300 --> 00:08:25,080
all the domains where humans are intelligent.
131
00:08:25,080 --> 00:08:27,780
There's no question that they will go beyond this.
132
00:08:27,780 --> 00:08:32,480
But it's just because human intelligence is not general at all.
133
00:08:32,480 --> 00:08:34,960
We are very specialized animals.
134
00:08:34,960 --> 00:08:41,220
We have a hard time imagining that we are specialized because all the problems
135
00:08:41,220 --> 00:08:48,080
that we can fathom or imagine are problems that we can fathom or imagine.
136
00:08:48,080 --> 00:08:54,940
But there is many, many more problems that we can't even imagine in our world's dream.
137
00:08:54,940 --> 00:08:59,500
and so it makes us appear generally intelligent.
138
00:08:59,500 --> 00:09:01,760
We're not. We're specialized.
139
00:09:01,760 --> 00:09:03,520
So we should lose that term,
140
00:09:03,520 --> 00:09:05,300
artificial general intelligence.
141
00:09:05,300 --> 00:09:08,980
I prefer the term human level intelligence or a code name
142
00:09:08,980 --> 00:09:15,480
that we've adopted inside Meta is an acronym AMI,
143
00:09:15,480 --> 00:09:18,620
which means Advanced Machine Intelligence,
144
00:09:18,620 --> 00:09:21,220
which is kind of a little more loose.
145
00:09:21,220 --> 00:09:24,020
Also, we pronounce it AMI.
146
00:09:24,020 --> 00:09:27,900
Which in French means friend.
147
00:09:28,140 --> 00:09:30,340
Makes sense.
148
00:09:30,340 --> 00:09:33,380
Okay. So how can we ever reach
149
00:09:33,380 --> 00:09:35,260
human level intelligence with machines?
150
00:09:35,260 --> 00:09:37,940
Machines that can learn, of course,
151
00:09:37,940 --> 00:09:40,220
can remember, understand the physical world,
152
00:09:40,220 --> 00:09:43,140
have common sense, can plan, can reason,
153
00:09:43,140 --> 00:09:46,020
are behaving properly,
154
00:09:46,020 --> 00:09:50,500
not being unruly, dangerous, etc.
155
00:09:50,500 --> 00:09:52,940
And the first question we should ask ourselves is,
156
00:09:52,940 --> 00:09:54,620
Why would we want to build this?
157
00:09:54,620 --> 00:09:57,260
So obviously there is a big scientific question of what is
158
00:09:57,260 --> 00:09:59,580
intelligence and the best way to
159
00:09:59,580 --> 00:10:04,060
validate any theory we have about intelligence is to
160
00:10:04,060 --> 00:10:07,500
build an artifact that actually implements it.
161
00:10:07,500 --> 00:10:11,500
That's a very engineering approach to science if you want.
162
00:10:11,500 --> 00:10:14,700
But there is another good reason and the other good reason is that
163
00:10:14,700 --> 00:10:20,140
we need human level intelligence to amplify human intelligence.
164
00:10:20,140 --> 00:10:24,620
There's going to be a future in which we run
165
00:10:24,620 --> 00:10:29,700
around with AI assistant with us at all times,
166
00:10:29,700 --> 00:10:32,460
so we can ask any question from them.
167
00:10:32,460 --> 00:10:34,280
They can answer any question we have.
168
00:10:34,280 --> 00:10:35,680
They can help us in our daily lives.
169
00:10:35,680 --> 00:10:38,100
They can solve problems for us.
170
00:10:38,100 --> 00:10:40,060
This will amplify human intelligence,
171
00:10:40,060 --> 00:10:42,100
perhaps in the way that the printing press has
172
00:10:42,100 --> 00:10:45,320
amplified human intelligence in the 15th century.
173
00:10:45,320 --> 00:10:49,420
So we need this for humanity.
174
00:10:49,420 --> 00:10:53,780
In fact, I'm wearing a pair of smart glasses right now.
175
00:10:53,780 --> 00:10:56,540
I can ask it questions.
176
00:10:56,540 --> 00:10:57,660
It goes through Meta AI,
177
00:10:57,660 --> 00:10:59,500
which is the product version of
178
00:10:59,500 --> 00:11:02,060
LAMA 3 that many of you have heard of.
179
00:11:02,060 --> 00:11:04,340
I can ask you various things.
180
00:11:04,340 --> 00:11:06,780
So let me ask you something.
181
00:11:06,780 --> 00:11:09,060
I'm not going to use the microphone.
182
00:11:09,060 --> 00:11:13,780
Hey, Meta. Take a picture.
183
00:11:13,780 --> 00:11:16,700
You see that little light flash?
184
00:11:16,700 --> 00:11:19,420
Okay, you're all on picture.
185
00:11:19,600 --> 00:11:21,900
You'll be on social network soon.
186
00:11:26,000 --> 00:11:28,660
So, you know, I could ask it, you know,
187
00:11:28,720 --> 00:11:30,020
more complex questions, obviously.
188
00:11:31,060 --> 00:11:36,720
And this thing can also recognize through the camera.
189
00:11:36,840 --> 00:11:39,500
So you can ask it, what am I looking at?
190
00:11:39,860 --> 00:11:41,180
What is the species of plant?
191
00:11:42,340 --> 00:11:45,120
You know, you can look at a menu in Japanese
192
00:11:45,120 --> 00:11:46,340
and it will translate it for you.
193
00:11:46,340 --> 00:11:49,400
So, you know, this kind of assistance are coming.
194
00:11:49,400 --> 00:11:51,020
They're still pretty stupid,
195
00:11:51,020 --> 00:11:53,680
but they're already useful.
196
00:11:53,680 --> 00:11:56,080
But there is a future maybe,
197
00:11:56,080 --> 00:11:57,880
you know, 10, 20 years from now,
198
00:11:57,880 --> 00:12:00,100
where they will be really smart and they will
199
00:12:00,100 --> 00:12:01,240
assist us in their daily lives.
200
00:12:01,240 --> 00:12:03,900
So we need those systems to have human level intelligence,
201
00:12:03,900 --> 00:12:05,940
because that's the best way for them to not be
202
00:12:05,940 --> 00:12:08,500
frustrating for us to interact with.
203
00:12:08,500 --> 00:12:10,340
Okay. So on the one hand,
204
00:12:10,340 --> 00:12:12,720
there is the really interesting scientific question
205
00:12:12,720 --> 00:12:14,800
of what is intelligence.
206
00:12:14,800 --> 00:12:18,560
In the middle there is the technological challenge
207
00:12:18,560 --> 00:12:20,480
of building intelligent machines.
208
00:12:20,480 --> 00:12:22,980
Then at the other end, it's actually useful.
209
00:12:22,980 --> 00:12:26,720
It will actually be useful for people and for humanity more generally.
210
00:12:26,720 --> 00:12:30,660
So all of the conditions.
211
00:12:30,660 --> 00:12:34,640
Then the more important condition is that there are people with
212
00:12:34,640 --> 00:12:41,820
a lot of resources willing to actually invest for this to be true, like Meta.
213
00:12:41,820 --> 00:12:52,400
So, the characteristics that we want of those machines is that they need to be able to understand the physical world.
214
00:12:52,660 --> 00:12:55,520
Current AI systems do not understand the physical world.
215
00:12:57,560 --> 00:13:01,900
They don't understand the physical world nearly as well as your house cat.
216
00:13:03,520 --> 00:13:07,320
And so, I've been saying, you know, and of course, newspapers can have like this kind of title.
217
00:13:07,320 --> 00:13:11,240
You know, Yannick says AI is stupider than a cat.
218
00:13:11,240 --> 00:13:15,540
It's true, actually.
219
00:13:15,540 --> 00:13:19,400
We need AI systems that have persistent memory.
220
00:13:19,400 --> 00:13:22,940
We need them to be able to plan complex action sequences,
221
00:13:22,940 --> 00:13:25,660
which current systems are completely incapable of doing.
222
00:13:25,660 --> 00:13:27,760
We need them to be able to reason,
223
00:13:27,760 --> 00:13:29,740
and we need them to be controllable and safe.
224
00:13:29,740 --> 00:13:32,140
So basically, and by design,
225
00:13:32,140 --> 00:13:35,580
not by fine-tuning like it's done at the moment.
226
00:13:37,040 --> 00:13:40,740
That requires essentially new principles that are
227
00:13:40,740 --> 00:13:44,980
different from what current AI systems really are based on.
228
00:13:44,980 --> 00:13:48,980
So current systems, most of them anyway,
229
00:13:48,980 --> 00:13:51,800
perform inference by propagating signals through
230
00:13:51,800 --> 00:13:54,100
a bunch of layers of a neural net.
231
00:13:54,100 --> 00:13:58,640
I'm a big fan of that obviously, but it's very limited.
232
00:13:58,640 --> 00:14:03,260
There's only a small number of input-output functions that can
233
00:14:03,260 --> 00:14:06,100
be efficiently represented by feed-forward
234
00:14:06,100 --> 00:14:09,980
propagation through a bunch of layers in a neural net.
235
00:14:09,980 --> 00:14:13,300
There's a much more general approach to inference,
236
00:14:13,300 --> 00:14:17,140
which is not just running feed forward to a bunch of layers,
237
00:14:17,140 --> 00:14:19,480
but is based on optimization.
238
00:14:19,480 --> 00:14:22,900
So basically, there's an observation.
239
00:14:22,900 --> 00:14:28,780
You give the system a proposal for an output,
240
00:14:28,780 --> 00:14:31,380
and the system tells you to what extent
241
00:14:31,380 --> 00:14:34,340
the output is compatible with the observation.
242
00:14:34,340 --> 00:14:37,500
Okay. So give you a picture of an elephant.
243
00:14:37,500 --> 00:14:42,040
I put the representation of the label elephant or the text,
244
00:14:42,040 --> 00:14:43,120
and the system tells you,
245
00:14:43,120 --> 00:14:45,060
yeah, those two things are compatible.
246
00:14:45,060 --> 00:14:49,600
The label elephant is a good label for that image.
247
00:14:49,600 --> 00:14:51,620
If you put the picture of a table,
248
00:14:51,620 --> 00:14:53,860
it says no, it's incompatible.
249
00:14:53,860 --> 00:14:56,980
So if you have a system that basically measures
250
00:14:56,980 --> 00:15:00,020
the compatibility between an input and an output,
251
00:15:00,020 --> 00:15:02,440
then through optimization and search,
252
00:15:02,440 --> 00:15:06,440
you can find an output that is most compatible with the input.
253
00:15:06,440 --> 00:15:10,100
This is intrinsically more powerful as an inference mechanism
254
00:15:10,100 --> 00:15:13,440
than just running feed forward through a bunch of layers.
255
00:15:13,440 --> 00:15:16,720
Because basically, any computational problem
256
00:15:16,720 --> 00:15:19,260
can be reduced to an optimization problem.
257
00:15:19,260 --> 00:15:23,460
So that's the very basic principle on
258
00:15:23,460 --> 00:15:25,720
which future AI system should be built.
259
00:15:25,720 --> 00:15:27,940
Not propagating through a bunch of layers,
260
00:15:27,940 --> 00:15:30,040
but optimizing the answer so that
261
00:15:30,040 --> 00:15:31,680
it's most compatible with the input.
262
00:15:31,680 --> 00:15:34,440
Of course, this will involve deep learning system,
263
00:15:34,440 --> 00:15:36,160
back propagation, all that stuff.
264
00:15:36,160 --> 00:15:38,880
But the inference mechanism is very different.
265
00:15:38,880 --> 00:15:41,700
Now, this is not a new idea by all means.
266
00:15:41,700 --> 00:15:44,060
This type of inference is what is
267
00:15:44,060 --> 00:15:46,220
very standard in probabilistic inference.
268
00:15:46,220 --> 00:15:47,700
For example, if you have a graphical model,
269
00:15:47,700 --> 00:15:50,820
Bayesian network, you know the value of certain variables,
270
00:15:50,820 --> 00:15:53,340
you can infer the value of the other variables by
271
00:15:53,340 --> 00:15:56,400
minimizing a negative log likelihood or something like that,
272
00:15:56,400 --> 00:15:58,580
or with some energy function.
273
00:15:58,580 --> 00:16:01,180
So it's a very standard thing to do.
274
00:16:01,180 --> 00:16:02,780
There's nothing innovative about this,
275
00:16:02,780 --> 00:16:05,340
but people have forgotten about the fact that this is
276
00:16:05,340 --> 00:16:08,540
really much more powerful than feed-forward propagation.
277
00:16:08,540 --> 00:16:13,200
The framework that I like to explain this is called energy-based model.
278
00:16:13,200 --> 00:16:17,460
So basically, the function that measures the compatibility between X and Y,
279
00:16:17,460 --> 00:16:20,400
input and output, is an energy function that takes
280
00:16:20,400 --> 00:16:25,700
low values when input and output are compatible and larger values when they're not.
281
00:16:29,000 --> 00:16:33,600
So the type of inference that can take place to find
282
00:16:33,600 --> 00:16:35,840
the output could be a number of different things.
283
00:16:35,840 --> 00:16:41,480
If the representation of the output is continuous,
284
00:16:41,480 --> 00:16:43,460
and if the modules that we're talking about,
285
00:16:43,460 --> 00:16:45,620
the objectives, all the modules
286
00:16:45,620 --> 00:16:47,380
inside of the system are differentiable,
287
00:16:47,380 --> 00:16:49,820
you can use gradient-based optimization to find
288
00:16:49,820 --> 00:16:53,360
the best one good answer.
289
00:16:53,360 --> 00:16:56,780
But you can imagine that the output is discrete,
290
00:16:56,780 --> 00:16:58,340
combinatorial, and then you have to use
291
00:16:58,340 --> 00:17:02,500
other types of combinatorial optimization algorithms
292
00:17:02,500 --> 00:17:06,900
to figure out the best output.
293
00:17:06,900 --> 00:17:07,960
If that's the case,
294
00:17:07,960 --> 00:17:12,280
then you're talking to the wrong LeCun,
295
00:17:12,280 --> 00:17:14,960
because my brother is actually,
296
00:17:14,960 --> 00:17:16,620
he works at Google, nobody's perfect,
297
00:17:16,620 --> 00:17:19,980
but he works on,
298
00:17:19,980 --> 00:17:22,360
he's an expert in combinatorial optimization.
299
00:17:25,680 --> 00:17:29,260
So this type of inference gives AI systems
300
00:17:29,260 --> 00:17:31,300
kind of zero-shot learning ability.
301
00:17:31,300 --> 00:17:31,960
What does that mean?
302
00:17:31,960 --> 00:17:34,860
It means you give them a problem and if they can,
303
00:17:34,860 --> 00:17:36,900
if you can't formulate this problem in terms of
304
00:17:36,900 --> 00:17:38,880
the optimization problem then you get a solution to
305
00:17:38,880 --> 00:17:42,020
that problem without the system having to learn anything.
306
00:17:42,020 --> 00:17:43,900
Right? That's your shot.
307
00:17:43,900 --> 00:17:46,120
You are given, and you are students,
308
00:17:46,120 --> 00:17:49,320
you're given a new mathematics problem, something.
309
00:17:49,320 --> 00:17:52,320
You can think about it and perhaps
310
00:17:52,320 --> 00:17:55,460
solve it without learning anything new.
311
00:17:55,460 --> 00:17:59,460
Right? That's called zero shot scale.
312
00:17:59,460 --> 00:18:05,240
and in humans some psychologists also call this system two.
313
00:18:05,240 --> 00:18:10,320
So basically you devote your entire attention and consciousness to
314
00:18:10,320 --> 00:18:13,740
solving a problem that you concentrate on and you think about it and it might
315
00:18:13,740 --> 00:18:16,840
take a long time to solve that problem.
316
00:18:16,840 --> 00:18:17,980
That's system two.
317
00:18:17,980 --> 00:18:22,220
System one is when you act reactively.
318
00:18:22,220 --> 00:18:23,200
You don't have to think about it,
319
00:18:23,200 --> 00:18:25,360
it's become kind of subconscious, automatic.
320
00:18:25,360 --> 00:18:27,140
So if you are an experienced driver,
321
00:18:27,140 --> 00:18:28,360
you drive on the highway,
322
00:18:28,360 --> 00:18:29,380
you don't have to think about it.
323
00:18:29,380 --> 00:18:30,780
it's going to become automatic.
324
00:18:30,780 --> 00:18:34,880
You can hold a conversation with someone and everything.
325
00:18:34,880 --> 00:18:37,520
If you're a beginner though,
326
00:18:37,520 --> 00:18:39,980
it's your first time driving a car,
327
00:18:39,980 --> 00:18:41,920
you pay close attention.
328
00:18:41,920 --> 00:18:43,260
You're using your system to
329
00:18:43,260 --> 00:18:48,320
your entire capacity of your mind.
330
00:18:49,140 --> 00:18:53,520
So that's why we need to adopt this model.
331
00:18:53,520 --> 00:18:56,600
This framework of energy-based model is
332
00:18:56,600 --> 00:18:59,680
sort of the way to understand this at the theoretical level.
333
00:18:59,680 --> 00:19:01,420
I'm not gonna do a lot of theory here.
334
00:19:01,420 --> 00:19:03,300
This is a very diverse audience,
335
00:19:03,300 --> 00:19:05,300
but the basic idea is that,
336
00:19:05,300 --> 00:19:06,920
if you have two variables, X and Y,
337
00:19:06,920 --> 00:19:08,040
here there are scalars,
338
00:19:08,040 --> 00:19:12,380
but you can imagine that they are high dimensional inputs.
339
00:19:12,380 --> 00:19:16,800
The energy function is some sort of landscape
340
00:19:16,800 --> 00:19:20,800
where pairs of X and Y that are compatible
341
00:19:20,800 --> 00:19:23,500
have low energy and then low altitude if you want,
342
00:19:23,500 --> 00:19:25,920
and then pairs of X and Y's that are not compatible
343
00:19:25,920 --> 00:19:27,280
of higher energy.
344
00:19:27,280 --> 00:19:30,260
And so the goal of learning now is to shape
345
00:19:30,260 --> 00:19:32,880
this energy surface in such a way that it gives
346
00:19:32,880 --> 00:19:35,360
low energy to things you observe,
347
00:19:35,360 --> 00:19:38,880
training data, pairs of XY that you observe,
348
00:19:38,880 --> 00:19:41,520
and then higher energy to everything else.
349
00:19:41,520 --> 00:19:43,400
The first part is super easy
350
00:19:43,400 --> 00:19:44,860
because we know how to do gradient descent.
351
00:19:44,860 --> 00:19:48,760
So you give a pair of XY that you know are compatible
352
00:19:48,760 --> 00:19:51,860
and you tweak the system so that the scalar output,
353
00:19:51,860 --> 00:19:55,840
the energy, the scalar energy output that it produces
354
00:19:55,840 --> 00:20:00,000
You can tweak the parameters inside your big neural net so that the output goes down.
355
00:20:00,000 --> 00:20:06,240
Easy. The difficulty is how to make sure that the energy is higher outside of the training sample.
356
00:20:06,240 --> 00:20:10,080
The training samples in this diagram are represented by the black dots.
357
00:20:12,720 --> 00:20:18,480
And at some level, a lot of literature in machine learning is devoted to that problem.
358
00:20:18,480 --> 00:20:23,840
It's not formulated in the way I just did, but it's in probably a framework for example,
359
00:20:23,840 --> 00:20:31,060
This problem of making sure the energy of things outside the training data is high,
360
00:20:31,060 --> 00:20:33,240
is a major issue.
361
00:20:33,240 --> 00:20:40,280
It usually encounters intractable mathematical problems.
362
00:20:40,280 --> 00:20:42,160
Let me skip this for now.
363
00:20:42,160 --> 00:20:47,880
Okay. So now, the whole craze of AI over the last couple of years,
364
00:20:47,880 --> 00:20:50,880
three years let's say, has been around LLMs,
365
00:20:50,880 --> 00:20:53,320
Large language models and large language models should be
366
00:20:53,320 --> 00:20:56,200
really called auto-regressive large language models.
367
00:20:56,200 --> 00:21:00,660
So what they do is they're trained on lots of texts and they're
368
00:21:00,660 --> 00:21:03,900
basically trained to produce the next word,
369
00:21:03,900 --> 00:21:08,600
to predict the next word from a sequence of words that preceded.
370
00:21:09,640 --> 00:21:14,360
That's all they've been trained to do.
371
00:21:14,840 --> 00:21:17,680
Once the system has been trained,
372
00:21:17,680 --> 00:21:20,620
you can of course show it a piece of text and then ask
373
00:21:20,620 --> 00:21:23,440
to predict the next word and then you inject that next word into
374
00:21:23,440 --> 00:21:26,080
the input and ask you to predict the second next word,
375
00:21:26,080 --> 00:21:27,780
shift that into the input,
376
00:21:27,780 --> 00:21:29,060
third word, etc.
377
00:21:29,060 --> 00:21:30,620
So that's auto-regressive prediction.
378
00:21:30,620 --> 00:21:36,180
It's not a new concept that's been around for before I was born.
379
00:21:36,180 --> 00:21:39,000
So not recent.
380
00:21:39,000 --> 00:21:41,400
But it's system one.
381
00:21:41,400 --> 00:21:44,400
It's feed forward propagation through a bunch of layers.
382
00:21:44,400 --> 00:21:46,300
There is a fixed amount of
383
00:21:46,300 --> 00:21:50,240
computation devoted to computing every new token.
384
00:21:50,240 --> 00:21:56,280
So if you want a system to spend more resources producing an answer,
385
00:21:56,280 --> 00:21:57,540
a system of this type,
386
00:21:57,540 --> 00:22:01,960
you basically have to artificially make it produce more tokens,
387
00:22:01,960 --> 00:22:03,640
which seems kind of a hack.
388
00:22:03,640 --> 00:22:05,400
That's called chain of thought.
389
00:22:05,400 --> 00:22:13,260
There's various techniques to do approximate planning or reasoning using this.
390
00:22:13,260 --> 00:22:18,200
You basically have the system produce lots and lots of candidate outputs by
391
00:22:18,200 --> 00:22:23,920
kind of changing the noise in the way it produces the sequences and then within
392
00:22:23,920 --> 00:22:28,140
the list of outputs that it produces you search for a good one essentially.
393
00:22:28,140 --> 00:22:32,080
So there's a little bit of search there, a little bit of optimization but it's
394
00:22:32,080 --> 00:22:37,580
kind of a hack. So I don't believe those methods will ever lead to true
395
00:22:37,580 --> 00:22:44,840
intelligent behavior. In fact cognitive scientists agree. Cognitive scientists
396
00:22:44,840 --> 00:22:50,540
I've been looking at LLMs with a very critical eye and saying that this is not real intelligence.
397
00:22:50,540 --> 00:22:53,640
This is nothing like what we observe in people.
398
00:22:53,640 --> 00:22:59,840
Similarly, people coming from kind of the non-machine learning based AI community,
399
00:22:59,840 --> 00:23:03,240
people like Subarro Kambampati from Arizona State,
400
00:23:03,240 --> 00:23:05,740
I've been saying LLMs really cannot plan.
401
00:23:05,740 --> 00:23:09,340
So Rao has a whole bunch of papers.
402
00:23:14,840 --> 00:23:20,400
talk about the titles of those papers as LLMs can't plan,
403
00:23:20,400 --> 00:23:22,720
LLMs still can't plan,
404
00:23:22,720 --> 00:23:25,680
LLMs really, really can't plan,
405
00:23:25,680 --> 00:23:31,080
and even LLMs that claim to be able to plan can't actually plan.
406
00:23:31,080 --> 00:23:37,340
So we have a big problem there that the people who claim
407
00:23:37,340 --> 00:23:40,120
that somehow we're going to take the current paradigm,
408
00:23:40,120 --> 00:23:44,420
make it bigger, spend trillions on data centers,
409
00:23:44,420 --> 00:23:48,280
and collect every piece of data in the world and train
410
00:23:48,280 --> 00:23:50,940
LLMs and they're going to reach human level intelligence.
411
00:23:50,940 --> 00:23:53,340
That's completely false in my opinion.
412
00:23:53,340 --> 00:23:54,780
I might be wrong,
413
00:23:54,780 --> 00:23:58,140
but in my opinion, that's completely hopeless.
414
00:23:58,140 --> 00:24:01,180
So the question is, what is not hopeless?
415
00:24:01,180 --> 00:24:07,720
So if we agree to this basic principle of inference to optimization,
416
00:24:07,720 --> 00:24:12,700
how can we sort of instantiate this in
417
00:24:12,700 --> 00:24:15,000
a real intelligent system.
418
00:24:15,000 --> 00:24:18,100
Basically, doing a little bit of introspection,
419
00:24:18,100 --> 00:24:21,180
when we think, the way we think is generally
420
00:24:21,180 --> 00:24:24,060
independent of the language that we might be able to
421
00:24:24,060 --> 00:24:26,220
express this thought in.
422
00:24:26,220 --> 00:24:29,140
I'm thinking about saying things here and it's
423
00:24:29,140 --> 00:24:31,660
independent of whether I'm giving
424
00:24:31,660 --> 00:24:33,900
this talk in English or French.
425
00:24:33,900 --> 00:24:37,940
So there is a thought that is independent of language,
426
00:24:37,940 --> 00:24:41,140
and LLMs don't have this capacity really.
427
00:24:41,140 --> 00:24:45,140
When we think we have a mental model of the situation that we think of.
428
00:24:45,140 --> 00:24:47,900
We're planning a sequence of actions.
429
00:24:47,900 --> 00:24:52,020
We have a mental model that allows us to predict
430
00:24:52,020 --> 00:24:54,660
what the consequences of our actions are going to be,
431
00:24:54,660 --> 00:24:57,260
so that if we set a goal for ourselves,
432
00:24:57,260 --> 00:25:02,100
we can figure out a sequence of actions that will satisfy this goal.
433
00:25:02,100 --> 00:25:07,680
So, association of the model I talked about earlier is one like this,
434
00:25:07,680 --> 00:25:11,240
where you observe the world through a perception module.
435
00:25:11,240 --> 00:25:12,800
Think of it as a big neural net.
436
00:25:12,800 --> 00:25:15,800
It gives you some idea of the current state of the world.
437
00:25:15,800 --> 00:25:17,140
Now, of course, the current state of the world
438
00:25:17,140 --> 00:25:18,720
is whatever you can perceive,
439
00:25:18,720 --> 00:25:20,080
but your idea of the state of the world
440
00:25:20,080 --> 00:25:23,920
also contains stuff that you perceived in the past,
441
00:25:23,920 --> 00:25:27,460
stuff that you know, facts that you know about the world.
442
00:25:27,460 --> 00:25:31,480
So if I take this bottle of water
443
00:25:31,480 --> 00:25:35,380
and I move it from this side to that side of the lectern,
444
00:25:35,380 --> 00:25:40,460
Your model of the world hasn't changed much.
445
00:25:40,460 --> 00:25:45,020
Most of your ideas about the state of the world hasn't changed.
446
00:25:45,020 --> 00:25:50,420
What has changed is the content of this lectern and the position of that box.
447
00:25:50,420 --> 00:25:53,060
But other than that, not much.
448
00:25:53,060 --> 00:25:57,580
So the idea that somehow a perception gives you
449
00:25:57,580 --> 00:25:59,900
a complete picture of the state of the world is false.
450
00:25:59,900 --> 00:26:02,060
You need to combine this with a memory.
451
00:26:02,060 --> 00:26:04,260
So that's this memory module here.
452
00:26:04,260 --> 00:26:08,620
Combine your current perception with the content of your memory.
453
00:26:08,620 --> 00:26:11,200
That gives you an idea of the current state of the world.
454
00:26:11,200 --> 00:26:14,940
Now, what you're going to do is feed this to a world model,
455
00:26:14,940 --> 00:26:19,440
and you're going to hear that phrase many times in the rest of the talk.
456
00:26:19,440 --> 00:26:22,560
The role of this world model is to predict what
457
00:26:22,560 --> 00:26:25,220
the outcome of a sequence of actions is going to be.
458
00:26:25,220 --> 00:26:27,340
This could be actions that you're planning to take,
459
00:26:27,340 --> 00:26:29,540
or this could be the agent is planning to take,
460
00:26:29,540 --> 00:26:31,980
or actions that someone else may be taking,
461
00:26:31,980 --> 00:26:34,240
or some events that may be occurring.
462
00:26:34,240 --> 00:26:37,080
So predicting the outcome of
463
00:26:37,080 --> 00:26:40,920
a sequence of actions is what allows us to reason and plan.
464
00:26:41,800 --> 00:26:48,000
So you can probably tell that if I take this water bottle
465
00:26:48,000 --> 00:26:53,760
and I put it on his head and I lift my finger,
466
00:26:53,760 --> 00:26:57,320
you can have some pretty good idea of what's going to happen.
467
00:26:57,320 --> 00:26:59,080
It's probably going to fall, right?
468
00:26:59,080 --> 00:27:01,520
It's either going to fall on this side or that side.
469
00:27:01,520 --> 00:27:04,220
You may not be able to predict this because I'm balancing it.
470
00:27:04,220 --> 00:27:06,520
but it's going to fall on one side or the other.
471
00:27:06,520 --> 00:27:08,820
So to some extent, at an abstract level,
472
00:27:08,820 --> 00:27:10,440
you can say it's going to fall.
473
00:27:10,440 --> 00:27:12,720
I can't tell you exactly in which position,
474
00:27:12,720 --> 00:27:15,120
in which direction, but I can tell you it's going to fall.
475
00:27:15,120 --> 00:27:17,520
You have an intuitive physics model,
476
00:27:17,520 --> 00:27:20,440
which is in fact very sophisticated,
477
00:27:20,440 --> 00:27:23,280
even though the situation is incredibly simple.
478
00:27:23,280 --> 00:27:27,060
So that allows us to plan.
479
00:27:27,060 --> 00:27:29,200
This model of the world is what allows us to plan.
480
00:27:29,200 --> 00:27:34,200
So then we can have a system like this that has a task objective,
481
00:27:34,200 --> 00:27:38,040
It sets itself an objective for itself,
482
00:27:38,040 --> 00:27:42,680
or you set an objective that measures to what extent a task has been accomplished,
483
00:27:42,680 --> 00:27:48,200
whether the resulting state of the world matches some condition.
484
00:27:48,520 --> 00:27:53,560
You might also have a number of guardrail objectives,
485
00:27:53,560 --> 00:28:00,000
things that make sure that whatever actions the agent takes,
486
00:28:00,000 --> 00:28:03,360
nobody's going to get hurt, for example.
487
00:28:03,360 --> 00:28:08,360
So those square boxes are cost functions,
488
00:28:08,360 --> 00:28:10,840
they have an implicit scalar output,
489
00:28:10,840 --> 00:28:13,600
and the overall energy of the system is just the sum of
490
00:28:13,600 --> 00:28:18,120
the scalar outputs of all the red square boxes.
491
00:28:18,120 --> 00:28:19,760
The other modules there,
492
00:28:19,760 --> 00:28:22,040
the one with a round shape,
493
00:28:22,040 --> 00:28:24,920
are deterministic functions, neural nets, let's say,
494
00:28:24,920 --> 00:28:27,200
and the round shapes are variables.
495
00:28:27,200 --> 00:28:29,400
The action sequence is a latent variable,
496
00:28:29,400 --> 00:28:32,680
it's not observed, we're going to compute it by optimization.
497
00:28:32,680 --> 00:28:36,260
We're going to try to find a sequence of actions that minimize
498
00:28:36,260 --> 00:28:40,160
the sum of the task objective and the guardrail objectives,
499
00:28:40,160 --> 00:28:42,860
and that's going to be the output of the system.
500
00:28:44,440 --> 00:28:47,160
Again, that's intrinsically more powerful than
501
00:28:47,160 --> 00:28:50,580
just running through a bunch of feed-forward layers.
502
00:28:50,580 --> 00:28:53,860
So that's the basic architecture.
503
00:28:53,860 --> 00:28:57,000
We can specialize this architecture further.
504
00:28:57,000 --> 00:28:58,820
For a sequence of actions,
505
00:28:58,820 --> 00:29:02,060
I might need to use my work model multiple times.
506
00:29:02,060 --> 00:29:06,680
So if I move that model from here to here,
507
00:29:06,680 --> 00:29:08,200
and then from here to here,
508
00:29:08,200 --> 00:29:09,460
that's a sequence of two actions.
509
00:29:09,460 --> 00:29:11,640
I don't need to have a separate model for those two actions.
510
00:29:11,640 --> 00:29:14,200
It's the same model that is just applied twice.
511
00:29:14,200 --> 00:29:17,580
So that's what's represented here,
512
00:29:17,580 --> 00:29:21,360
where action one and action two go into the same model,
513
00:29:21,360 --> 00:29:24,700
and it computes the resulting state.
514
00:29:24,700 --> 00:29:28,520
Planning a sequence of actions to optimize a cost function,
515
00:29:28,520 --> 00:29:30,920
according to a model that you run multiple times,
516
00:29:31,080 --> 00:29:35,480
is a completely standard method in optimal control called model predictive control.
517
00:29:36,040 --> 00:29:41,720
It's been around with us for since the early 60s so it's as old as me.
518
00:29:43,320 --> 00:29:50,920
And this is what you know the entire optimal control community uses to do motion planning.
519
00:29:50,920 --> 00:29:57,160
Robotics uses motion planning. NASA uses motion planning to you know plan the trajectory of
520
00:29:57,160 --> 00:29:58,920
rockets to rendezvous the space station.
521
00:29:58,920 --> 00:30:00,780
It's this type of model.
522
00:30:00,780 --> 00:30:03,480
The difference here is that the world model is going to be learned.
523
00:30:03,480 --> 00:30:04,360
It's going to be trained.
524
00:30:04,360 --> 00:30:08,080
It's not going to be returned by hand with a bunch of equations.
525
00:30:08,080 --> 00:30:10,340
It's going to be trained from data.
526
00:30:10,340 --> 00:30:13,540
Of course, the question is, how do we do this?
527
00:30:13,540 --> 00:30:14,840
I'll come to this in a second.
528
00:30:14,840 --> 00:30:18,320
Now, the sad thing about the world is two things.
529
00:30:18,320 --> 00:30:24,060
First thing is, you cannot run the world faster than real-time.
530
00:30:24,060 --> 00:30:27,500
That's the limitation.
531
00:30:27,500 --> 00:30:28,940
We have to deal with that.
532
00:30:28,940 --> 00:30:31,220
The second one is that the world is not deterministic.
533
00:30:31,220 --> 00:30:36,160
Or if it is deterministic as some physicists tell us it is,
534
00:30:36,160 --> 00:30:38,860
it's not entirely predictable because we don't have
535
00:30:38,860 --> 00:30:41,960
a full observation of the state of the world.
536
00:30:41,960 --> 00:30:45,260
The way you model
537
00:30:45,260 --> 00:30:48,720
non-deterministic functions out of deterministic functions,
538
00:30:48,720 --> 00:30:51,820
is that you feed them extra inputs that are latent variables.
539
00:30:51,820 --> 00:30:54,560
Those are variables whose values you don't know,
540
00:30:54,560 --> 00:30:57,480
and you can make them swipe through a bunch of,
541
00:30:57,480 --> 00:31:01,100
to a set or you can sample them from distributions.
542
00:31:01,100 --> 00:31:03,260
For each value of the latent variable,
543
00:31:03,260 --> 00:31:06,260
you get a different prediction from your model.
544
00:31:06,260 --> 00:31:10,220
Okay. So a distribution over the latent variable implies
545
00:31:10,220 --> 00:31:13,580
a distribution over the output of the model.
546
00:31:13,580 --> 00:31:17,060
That's the way to handle uncertainty.
547
00:31:17,060 --> 00:31:20,260
Of course, you know, you have to plan in the presence of uncertainty.
548
00:31:20,260 --> 00:31:28,260
So you want to make sure that your plan will succeed regardless of what the values of the latent variable will be.
549
00:31:30,260 --> 00:31:37,260
But in fact, humans and animals don't do planning this way. We do hierarchical planning.
550
00:31:37,260 --> 00:31:44,260
So hierarchical planning means that we have multiple levels of abstraction for representing the state of the world.
551
00:31:44,260 --> 00:31:49,260
We don't represent the world always with the same level of abstraction.
552
00:31:49,260 --> 00:31:52,660
Let me take a concrete example here.
553
00:31:52,660 --> 00:31:56,100
So let's say I'm sitting in my office in NYU
554
00:31:56,100 --> 00:31:57,520
and I want to go to Paris.
555
00:31:58,540 --> 00:32:00,200
At a very high abstract level,
556
00:32:00,200 --> 00:32:02,380
I can predict that if I decide right now
557
00:32:02,380 --> 00:32:03,880
to be in Paris tomorrow morning,
558
00:32:05,420 --> 00:32:07,640
I can go to the airport tonight
559
00:32:07,640 --> 00:32:10,540
and catch a plane to Paris and fly overnight.
560
00:32:11,500 --> 00:32:13,380
That's a plan, it's a very high level plan.
561
00:32:13,380 --> 00:32:15,240
I can't predict all the details of what's gonna happen,
562
00:32:15,240 --> 00:32:16,540
but at a high level,
563
00:32:16,540 --> 00:32:21,540
I know that I need to go to the airport and then catch a plane.
564
00:32:21,540 --> 00:32:24,540
Now I have a sub-goal. How do I go to the airport?
565
00:32:24,540 --> 00:32:30,540
Well, I need to go down on the street and hail a taxi because we're in New York.
566
00:32:30,540 --> 00:32:33,540
How do I go down on the street?
567
00:32:33,540 --> 00:32:39,540
I need to go to the elevator, push the button, and then walk out the door.
568
00:32:39,540 --> 00:32:42,540
How do I go to the elevator?
569
00:32:42,540 --> 00:32:51,920
I need to stand up from my chair, pick up my bag, open the door, close the door, walk to the elevator, avoid all the obstacles that I perceive, push the button.
570
00:32:53,120 --> 00:32:54,420
How do I stand up from my chair?
571
00:32:56,060 --> 00:33:01,860
So there is a level below which language is insufficient to express what we need to do.
572
00:33:02,800 --> 00:33:05,080
You cannot explain to someone how you stand up from a chair.
573
00:33:06,540 --> 00:33:10,920
You cannot have to know this in your muscle.
574
00:33:10,920 --> 00:33:13,800
You need to understand the physical world to be able to do this.
575
00:33:13,800 --> 00:33:16,220
So that's the other limitation of LLMs.
576
00:33:16,220 --> 00:33:20,420
Their level of abstraction is high because they manipulate language,
577
00:33:20,420 --> 00:33:23,800
but they're not grounded on reality.
578
00:33:23,800 --> 00:33:27,380
They have no idea what the physical world is like.
579
00:33:27,380 --> 00:33:33,260
That drives them to make really stupid mistakes and appear very,
580
00:33:33,260 --> 00:33:35,540
very stupid in many situations.
581
00:33:35,540 --> 00:33:38,640
So we need systems that really go
582
00:33:38,640 --> 00:33:41,160
down all the way down to the level.
583
00:33:41,160 --> 00:33:43,960
And this is what your house cat can do
584
00:33:43,960 --> 00:33:45,300
and LLMs cannot do.
585
00:33:46,140 --> 00:33:48,140
Which is why I'm saying your house cat is smarter
586
00:33:48,140 --> 00:33:50,800
than the smartest LLMs.
587
00:33:50,800 --> 00:33:54,020
Of course house cats don't have nearly as much
588
00:33:54,020 --> 00:33:58,600
abstract knowledge stored in their memory as an LLM.
589
00:33:58,600 --> 00:34:02,120
But they're really smart in their understanding of the world
590
00:34:02,120 --> 00:34:03,200
and their ability to plan.
591
00:34:03,200 --> 00:34:05,200
And they can plan hierarchically as well.
592
00:34:05,200 --> 00:34:13,540
So what we need there is, you know, world models that are at multiple levels of abstraction,
593
00:34:13,540 --> 00:34:16,420
and how to train this is not completely obvious.
594
00:34:16,420 --> 00:34:23,520
Okay, so this whole idea, this whole kind of spiel leads to a view of AI that I call
595
00:34:23,520 --> 00:34:25,420
Objective Driven AI Systems.
596
00:34:25,420 --> 00:34:26,760
It's a recent name.
597
00:34:26,760 --> 00:34:33,700
I wrote a vision paper two and a half years ago that I put online at this URL in open
598
00:34:33,700 --> 00:34:41,300
review is not on archive because I work on comments and so that I can update this paper.
599
00:34:41,300 --> 00:34:47,260
And it's the groundwork for the talk I'm giving at the moment, but in the last two and a half
600
00:34:47,260 --> 00:34:51,380
years we've made progress towards that plan, so I'm going to give you some experimental
601
00:34:51,380 --> 00:34:55,340
results and things we built.
602
00:34:55,340 --> 00:34:59,760
So the architecture I'm proposing in that paper is a so-called cognitive architecture
603
00:34:59,760 --> 00:35:02,300
that has the components I just expressed,
604
00:35:02,300 --> 00:35:03,800
things like a perception module
605
00:35:03,800 --> 00:35:05,440
that estimates the state of the world,
606
00:35:05,440 --> 00:35:08,440
a memory that you can use,
607
00:35:08,440 --> 00:35:11,160
a world model which is kind of a centerpiece a little bit,
608
00:35:11,160 --> 00:35:12,940
a bunch of cost modules
609
00:35:12,940 --> 00:35:16,800
that are either defining tasks or guardrails,
610
00:35:16,800 --> 00:35:18,840
and then an actor, and what the actor does
611
00:35:18,840 --> 00:35:20,720
is that basically finding,
612
00:35:20,720 --> 00:35:22,520
doing this optimization procedure,
613
00:35:22,520 --> 00:35:24,020
finding the best sequence of actions
614
00:35:24,020 --> 00:35:26,380
to satisfy the objectives.
615
00:35:26,380 --> 00:35:28,600
This is mysterious configurator module at the top,
616
00:35:28,600 --> 00:35:29,780
I'm not going to explain,
617
00:35:29,780 --> 00:35:36,160
but basically its role would be to set the goal for the current situation.
618
00:35:36,160 --> 00:35:37,160
Okay.
619
00:35:37,160 --> 00:35:43,100
Okay. So perhaps with an architecture of this type,
620
00:35:43,100 --> 00:35:45,840
we will have systems that understand the physical world, etc.
621
00:35:45,840 --> 00:35:51,000
But we have to, and have system two ability of kind of reasoning.
622
00:35:51,000 --> 00:35:55,460
But then how can we learn those world models from sensory inputs?
623
00:35:55,460 --> 00:35:57,520
That's really kind of the trick.
624
00:35:57,520 --> 00:36:00,280
And the answer to this is self-supervised learning.
625
00:36:00,280 --> 00:36:07,280
So self-supervised learning is something that has been extremely successful in the context of natural language understanding over the last few years.
626
00:36:07,280 --> 00:36:10,080
Basically it's completely dominating NLP.
627
00:36:10,080 --> 00:36:15,160
Every NLP system, LLM, etc. are trained with self-supervised learning.
628
00:36:15,160 --> 00:36:19,080
What does that mean? It means that there is no difference between inputs and outputs.
629
00:36:19,080 --> 00:36:27,480
Basically you take a big input, you corrupt it in some way, and you train some gigantic neural net to restore the full input if you want.
630
00:36:27,480 --> 00:36:32,480
But, you know, it's not going to be sufficient.
631
00:36:32,480 --> 00:36:36,480
We're still, you know, we're missing another piece of evidence
632
00:36:36,480 --> 00:36:39,480
that we're missing something big about intelligence is that,
633
00:36:39,480 --> 00:36:44,480
although we have NLMs that can pass the bar exam,
634
00:36:44,480 --> 00:36:50,480
or some high school exams, maybe not calculus one, I don't know,
635
00:36:50,480 --> 00:36:56,480
we still do not have domestic robots that can accomplish tasks
636
00:36:56,480 --> 00:37:00,480
A 10 year old can learn in one shot or zero shot.
637
00:37:00,480 --> 00:37:02,480
The first time you ask a 10 year old,
638
00:37:02,480 --> 00:37:04,480
clear the dinner table and fill up the dishwasher,
639
00:37:04,480 --> 00:37:06,480
they're able to do it.
640
00:37:06,480 --> 00:37:08,480
They don't need to learn.
641
00:37:08,480 --> 00:37:10,480
They can just plan.
642
00:37:12,480 --> 00:37:14,480
Any 17 year old can learn to drive a car
643
00:37:14,480 --> 00:37:16,480
in about 20 hours of practice.
644
00:37:16,480 --> 00:37:20,480
We still do not have level 5 autonomous self-driving cars.
645
00:37:20,480 --> 00:37:22,480
We have level 2, we have level 3,
646
00:37:22,480 --> 00:37:24,480
so they're partially autonomous.
647
00:37:24,480 --> 00:37:29,360
autonomous. We have some level fives in limited areas, but they are very
648
00:37:29,360 --> 00:37:32,700
instrumented and they cheat. They have a map of the entire environment, so if you
649
00:37:32,700 --> 00:37:36,660
think about the Waymo cars, that's where they are. And they certainly don't
650
00:37:36,660 --> 00:37:42,120
need only 20 hours of practice to learn to drive. So that's what we're missing,
651
00:37:42,120 --> 00:37:47,140
something big. And that's really a new version of the Moravec paradox that, you
652
00:37:47,140 --> 00:37:50,880
know, things that are easy for humans are difficult for AI and vice versa. And
653
00:37:50,880 --> 00:37:54,760
we've tended to neglect the complexity
654
00:37:54,760 --> 00:37:55,940
of dealing with the real world,
655
00:37:55,940 --> 00:38:00,720
like perception and action, motor control.
656
00:38:00,720 --> 00:38:02,320
Perhaps a reason for this
657
00:38:02,320 --> 00:38:05,480
resides in this really simple calculation.
658
00:38:05,480 --> 00:38:07,560
An LLM, a typical LLM of today,
659
00:38:07,560 --> 00:38:10,060
is trained on 20 trillion tokens, okay?
660
00:38:10,060 --> 00:38:11,360
Two, 10 to the 13.
661
00:38:13,300 --> 00:38:17,140
That corresponds to a little less than 20 trillion words,
662
00:38:17,140 --> 00:38:18,560
because the token is a subword unit.
663
00:38:18,560 --> 00:38:21,860
Each token usually is represented by three bytes
664
00:38:21,860 --> 00:38:22,680
or something like that.
665
00:38:22,680 --> 00:38:25,920
So that is a volume of training data
666
00:38:25,920 --> 00:38:27,880
of six, 10 to the 13 bytes.
667
00:38:29,800 --> 00:38:31,420
That would take a few hundred thousand years
668
00:38:31,420 --> 00:38:33,320
for any of us to read through that material.
669
00:38:33,320 --> 00:38:36,800
It's basically the entire text
670
00:38:36,800 --> 00:38:38,400
available publicly on the internet.
671
00:38:39,940 --> 00:38:43,200
Now a human child, a four-year-old,
672
00:38:43,200 --> 00:38:46,280
has been awake a total of 16,000 hours.
673
00:38:46,280 --> 00:38:49,780
That's what developmental psychologists tell me.
674
00:38:50,640 --> 00:38:52,040
Which by the way is not a lot of data,
675
00:38:52,040 --> 00:38:54,040
that's 30 minutes of YouTube uploads.
676
00:38:56,940 --> 00:39:00,880
And I don't know how much Instagram, I should.
677
00:39:02,000 --> 00:39:05,140
We have two million optic nerve fibers
678
00:39:05,140 --> 00:39:07,640
going to our brain through our eyes.
679
00:39:07,640 --> 00:39:09,800
The amount of information getting to the eyes is enormous
680
00:39:09,800 --> 00:39:12,040
because we have 100 million photosensors
681
00:39:12,040 --> 00:39:13,540
or something like that.
682
00:39:13,540 --> 00:39:15,660
But it's being reduced to squeeze down
683
00:39:15,660 --> 00:39:18,100
to the optical nerve before it gets to the brain.
684
00:39:18,100 --> 00:39:20,540
And that's about two million nerve fibers,
685
00:39:20,540 --> 00:39:23,300
each carrying a little less than one byte per second,
686
00:39:23,300 --> 00:39:25,020
a few bits per second, okay?
687
00:39:25,020 --> 00:39:30,020
So the volume of data there is about 10 to the 14 bytes,
688
00:39:32,040 --> 00:39:32,880
maybe a little less.
689
00:39:32,880 --> 00:39:36,000
It's the same order of magnitude as the biggest LLM.
690
00:39:36,000 --> 00:39:38,900
In four years, a child has seen more data
691
00:39:40,260 --> 00:39:43,960
about the real world than the biggest LLM trained
692
00:39:43,960 --> 00:39:46,580
on the entirety of all the publicly available texts
693
00:39:46,580 --> 00:39:48,660
on the internet that we take any of us,
694
00:39:50,360 --> 00:39:52,560
you know, hundreds of millennia to read through.
695
00:39:53,500 --> 00:39:55,240
So that tells you we're never gonna reach
696
00:39:55,240 --> 00:39:57,120
human level intelligence by training on text.
697
00:39:57,120 --> 00:39:58,300
It's just not happening.
698
00:39:59,360 --> 00:40:01,960
Okay, we need systems to really understand the world
699
00:40:01,960 --> 00:40:05,900
through high bandwidth input, like vision or touch.
700
00:40:05,900 --> 00:40:07,320
Okay, blind people can get smart
701
00:40:07,320 --> 00:40:09,220
because they have other senses.
702
00:40:11,780 --> 00:40:13,880
And in fact, you know, if you look at how long it takes
703
00:40:13,880 --> 00:40:21,300
For children, infants, to learn basic concepts about the real world, it takes several months.
704
00:40:21,940 --> 00:40:29,500
So a child will learn the difference between animate and inanimate objects within the first
705
00:40:29,500 --> 00:40:34,040
three months of life, opening their eyes. Object permanence appears really early,
706
00:40:34,440 --> 00:40:39,980
maybe around two months. Notions of solidity, rigidity, and stability and support,
707
00:40:39,980 --> 00:40:45,340
that's in the first six months. So this idea that, you know, this is not going to be stable is going
708
00:40:45,340 --> 00:40:53,900
to fall. And then notions of intuitive physics like gravity, inertia, conservation of momentum,
709
00:40:53,900 --> 00:40:59,260
this kind of stuff, that we have an intuitive level that any animal has too, that only pops
710
00:40:59,260 --> 00:41:04,380
up around nine months in baby humans, much earlier in baby goats and other animals.
711
00:41:09,980 --> 00:41:14,940
Most of that is through observation. There's not much interaction. You know, babies can hardly
712
00:41:14,940 --> 00:41:20,380
affect the world in the first four months of life. They do afterwards. If you put an eight-month-old
713
00:41:20,380 --> 00:41:24,140
baby on a chair with a bunch of toys, the first thing they'll do is throw the toys on the ground
714
00:41:24,140 --> 00:41:28,460
because that's how they do the experiment about gravity. You know, does it apply to this new thing
715
00:41:28,460 --> 00:41:36,220
I'm seeing on my chair? Okay, so there is a very natural idea which is to transpose the stuff that
716
00:41:36,220 --> 00:41:38,820
that has worked for text to video.
717
00:41:38,820 --> 00:41:42,360
Can we just train a generative model to learn to predict video?
718
00:41:42,360 --> 00:41:44,760
And then that system will just understand how the world works,
719
00:41:44,760 --> 00:41:48,020
because it's going to be able to predict what happens in the video.
720
00:41:48,020 --> 00:41:53,640
And it's been a bit of my obsession in terms of research for
721
00:41:53,640 --> 00:41:56,760
the last at least 15 years, if not more.
722
00:41:56,760 --> 00:41:59,460
Okay, so this predates LLMs and everything.
723
00:41:59,460 --> 00:42:01,520
Okay, this idea that you can learn by prediction,
724
00:42:01,520 --> 00:42:03,120
it's a very old concept in neuroscience,
725
00:42:03,120 --> 00:42:05,720
but it's something I've really been sort of,
726
00:42:05,720 --> 00:42:08,480
working on with my students,
727
00:42:08,480 --> 00:42:11,520
collaborators for many years.
728
00:42:11,520 --> 00:42:15,280
And the idea of course is to use a generative model, right?
729
00:42:15,280 --> 00:42:18,640
Give to a system a piece of video,
730
00:42:19,240 --> 00:42:23,320
and then try to predict what's going to happen next in the video.
731
00:42:23,320 --> 00:42:28,000
Just the same way that we train LLMs to predict what happens next in the text.
732
00:42:28,800 --> 00:42:33,560
Perhaps if you want the system to be kind of a role model,
733
00:42:33,560 --> 00:42:37,180
you can feed this model with an action variable,
734
00:42:37,180 --> 00:42:38,680
the A variable here,
735
00:42:38,680 --> 00:42:42,040
which in this case would simply be masking essentially.
736
00:42:42,040 --> 00:42:43,780
So take a video, mask a piece of it,
737
00:42:43,780 --> 00:42:45,600
let's say the second half of it,
738
00:42:45,600 --> 00:42:47,080
run it through some big neural net and
739
00:42:47,080 --> 00:42:50,500
train it to predict the second half of the full video.
740
00:42:50,760 --> 00:42:54,740
We tried for a good part of 15 years,
741
00:42:54,740 --> 00:42:56,500
it doesn't work.
742
00:42:56,500 --> 00:42:59,620
It doesn't work because there are many,
743
00:42:59,620 --> 00:43:02,000
many things that can happen in a video and a system of
744
00:43:02,000 --> 00:43:04,000
This type basically will just predict one thing.
745
00:43:05,700 --> 00:43:07,880
And so one way to deal with this problem
746
00:43:07,880 --> 00:43:10,240
of predicting one thing, so it's gonna predict one thing.
747
00:43:10,240 --> 00:43:12,840
So the best thing you can predict is the average
748
00:43:12,840 --> 00:43:15,640
of all the possible, plausible things that may happen.
749
00:43:15,640 --> 00:43:16,620
And you see an example here,
750
00:43:16,620 --> 00:43:19,060
that's an early paper in video prediction,
751
00:43:19,060 --> 00:43:20,720
trying to predict what's gonna happen
752
00:43:20,720 --> 00:43:24,820
is this really short six frame video with this little girl.
753
00:43:24,820 --> 00:43:27,400
The four frame, the first four frames are observed,
754
00:43:27,400 --> 00:43:30,460
the last two are predicted, and what you see is a blurry mess,
755
00:43:30,460 --> 00:43:31,640
because the system really cannot predict
756
00:43:31,640 --> 00:43:34,200
What's going to happen is we predict the average.
757
00:43:34,400 --> 00:43:36,840
You see this at the bottom as well,
758
00:43:36,840 --> 00:43:38,880
if you can play that video again.
759
00:43:38,880 --> 00:43:41,400
This is a top-down view of a highway,
760
00:43:41,400 --> 00:43:43,600
and the green things are like cars.
761
00:43:43,600 --> 00:43:46,800
The second column are predictions made by
762
00:43:46,800 --> 00:43:49,280
neural net trying to predict what's going to happen in that video.
763
00:43:49,280 --> 00:43:52,960
You see those blurry extending cars
764
00:43:52,960 --> 00:43:55,720
because it really cannot predict what's happening.
765
00:43:55,720 --> 00:43:58,840
So the columns on the right are
766
00:43:58,840 --> 00:44:01,160
a different model that has a latent variable which is
767
00:44:01,160 --> 00:44:04,760
designed to capture the variability between the potential prediction,
768
00:44:04,760 --> 00:44:07,200
and those predictions are not blurry.
769
00:44:07,200 --> 00:44:14,180
So we thought that we had a good solution to that problem five years ago with latent variables,
770
00:44:14,180 --> 00:44:16,580
but it turns out to not work for real video.
771
00:44:16,580 --> 00:44:18,200
It works for simple videos like this one,
772
00:44:18,200 --> 00:44:20,980
but it doesn't for real world.
773
00:44:20,980 --> 00:44:24,120
So we can't train this thing on video.
774
00:44:24,120 --> 00:44:26,880
So the solution to that problem is interesting,
775
00:44:26,880 --> 00:44:30,060
is to abandon the whole idea of generative models.
776
00:44:30,060 --> 00:44:37,060
Everybody is talking about generality model like it's the new Messiah.
777
00:44:37,060 --> 00:44:41,420
What I'm telling you today is forget about generality models.
778
00:44:41,420 --> 00:44:45,120
Okay. The solution to that problem,
779
00:44:45,120 --> 00:44:48,280
we think, is what we call joint embedding architectures,
780
00:44:48,280 --> 00:44:51,680
or more precisely joint embedding predictive architectures.
781
00:44:51,680 --> 00:44:53,840
This is really the way to build a world model.
782
00:44:53,840 --> 00:44:56,180
So what is this consistent?
783
00:44:56,180 --> 00:44:58,000
It's you take that video,
784
00:44:58,000 --> 00:44:59,900
you corrupt it, you mask a piece of it,
785
00:44:59,900 --> 00:45:01,720
for example, okay?
786
00:45:01,720 --> 00:45:04,060
And you run it through a big neural net,
787
00:45:04,060 --> 00:45:05,920
but what the big neural net is trained to do
788
00:45:05,920 --> 00:45:08,520
is not predict all the pixels in the video,
789
00:45:08,520 --> 00:45:11,320
it's trained to predict an abstract representation
790
00:45:12,400 --> 00:45:14,360
of the future of that video, okay?
791
00:45:14,360 --> 00:45:16,280
So you take the original video,
792
00:45:16,280 --> 00:45:17,460
you take the masked one,
793
00:45:17,460 --> 00:45:18,960
you run them through encoders,
794
00:45:18,960 --> 00:45:21,520
now you have abstract representations
795
00:45:21,520 --> 00:45:24,920
of the full video and the corrupted one,
796
00:45:24,920 --> 00:45:26,820
and you train a predictor
797
00:45:26,820 --> 00:45:28,540
to predict the representation of the full video,
798
00:45:28,540 --> 00:45:30,900
from the representation of the corrupted one.
799
00:45:32,020 --> 00:45:32,820
Okay.
800
00:45:32,820 --> 00:45:33,700
This is called JEPA.
801
00:45:33,700 --> 00:45:35,660
That means Joint Embedding Predictive Architecture.
802
00:45:35,660 --> 00:45:37,580
There's a bunch of papers from the last few years
803
00:45:37,580 --> 00:45:41,340
that my collaborators and I have published on this idea.
804
00:45:41,340 --> 00:45:43,780
And it solves the problem of having to predict
805
00:45:43,780 --> 00:45:47,100
all kinds of details that you really cannot predict.
806
00:45:47,100 --> 00:45:49,580
So if I were to take a video of this crowd,
807
00:45:50,980 --> 00:45:52,940
in fact I can take a video of this crowd.
808
00:45:55,020 --> 00:45:57,380
Okay, now I'm taking a video of you guys.
809
00:45:57,380 --> 00:46:01,460
Okay, and I slowly turn my head towards the right.
810
00:46:03,440 --> 00:46:04,780
Gonna shut down the video now.
811
00:46:06,740 --> 00:46:09,860
Certainly, a prediction system can predict this is a room,
812
00:46:09,860 --> 00:46:13,280
it's a conference room, there's people sitting everywhere.
813
00:46:13,280 --> 00:46:16,420
It may not be able to predict that all the chairs are full.
814
00:46:16,420 --> 00:46:18,000
It certainly cannot predict
815
00:46:18,000 --> 00:46:20,080
what every single one of you looks like.
816
00:46:20,080 --> 00:46:21,060
There's absolutely no way.
817
00:46:21,060 --> 00:46:22,800
It cannot predict what the texture on the wall
818
00:46:22,800 --> 00:46:26,860
is going to be, or even the color of the side.
819
00:46:26,860 --> 00:46:30,200
So there are things that are just completely unpredictable.
820
00:46:30,200 --> 00:46:31,620
You don't have the information to do it.
821
00:46:31,620 --> 00:46:34,260
And if you train a system to predict all those details,
822
00:46:34,260 --> 00:46:36,240
it's going to spend all of its resources
823
00:46:36,240 --> 00:46:37,660
predicting irrelevant details.
824
00:46:38,540 --> 00:46:40,220
So what a jet pad does when you train it,
825
00:46:40,220 --> 00:46:41,980
and I'm gonna tell you how you train this,
826
00:46:41,980 --> 00:46:45,700
is that it finds a trade-off between extracting
827
00:46:45,700 --> 00:46:48,040
as much information as possible from the input,
828
00:46:48,040 --> 00:46:50,340
but only extracting things that it can predict.
829
00:46:53,260 --> 00:46:55,100
And there is an issue with those kinds of architectures.
830
00:46:55,100 --> 00:47:01,100
Here is a contrast between the generative architecture that tried to reproduce Y directly
831
00:47:01,100 --> 00:47:06,640
and the joint embedding architecture which only tries to do prediction in representation
832
00:47:06,640 --> 00:47:09,560
space on the right.
833
00:47:09,560 --> 00:47:14,480
There's a problem with the joint embedding architecture and this is why we've only been
834
00:47:14,480 --> 00:47:16,100
working on this in recent years.
835
00:47:16,100 --> 00:47:21,100
It is the fact that if you just train the parameters of those neural nets to minimize
836
00:47:21,100 --> 00:47:23,940
the prediction error, it collapses.
837
00:47:23,940 --> 00:47:27,340
basically ignores the inputs X and Y.
838
00:47:27,340 --> 00:47:29,400
It makes prediction for SX and SY,
839
00:47:29,400 --> 00:47:32,260
the two representations that are constant.
840
00:47:32,260 --> 00:47:34,180
Another prediction problem is trivial.
841
00:47:37,220 --> 00:47:39,200
And that's not a good thing.
842
00:47:39,200 --> 00:47:43,240
So that's an example of this energy-based framework
843
00:47:43,240 --> 00:47:44,960
that I was describing earlier.
844
00:47:46,060 --> 00:47:50,200
It gives zero energy to every pair of XY, essentially.
845
00:47:50,200 --> 00:47:51,420
But what you want is zero energy
846
00:47:51,420 --> 00:47:53,160
for the pairs of XY you're training on,
847
00:47:53,160 --> 00:47:55,940
but higher energy for things that you don't train it on,
848
00:47:55,940 --> 00:47:57,820
and that's the hard part.
849
00:47:57,820 --> 00:48:01,780
So next I'm going to explain how you make that possible,
850
00:48:01,780 --> 00:48:05,280
how you make sure that the pairs of XY
851
00:48:05,280 --> 00:48:07,480
that are not compatible have a higher energy.
852
00:48:09,740 --> 00:48:12,140
There's variations of those architectures,
853
00:48:12,140 --> 00:48:14,220
some of which can be sort of have latent variables
854
00:48:14,220 --> 00:48:17,140
or have the action condition if you want
855
00:48:17,140 --> 00:48:18,680
to predict it to be a model.
856
00:48:19,720 --> 00:48:22,240
And there's been papers on this for many years now.
857
00:48:22,240 --> 00:48:24,200
The earliest joint embedding architecture actually
858
00:48:24,200 --> 00:48:25,320
is from the early 90s.
859
00:48:25,320 --> 00:48:28,000
It's a paper of mine about Siamese networks.
860
00:48:30,060 --> 00:48:31,720
But we're gonna have to train
861
00:48:31,720 --> 00:48:34,240
those sort of generic architectures.
862
00:48:34,240 --> 00:48:36,400
So how do we do this?
863
00:48:37,440 --> 00:48:38,680
So remember this picture, right?
864
00:48:38,680 --> 00:48:41,260
We wanna give low energy to stuff that are compatible,
865
00:48:41,260 --> 00:48:43,260
things that we observe, training sets,
866
00:48:43,260 --> 00:48:44,940
training samples, X and Y,
867
00:48:44,940 --> 00:48:46,440
higher energy to everything else.
868
00:48:47,740 --> 00:48:48,860
So there are two sets of methods,
869
00:48:48,860 --> 00:48:51,840
contrasting methods and what I call regularized methods.
870
00:48:51,840 --> 00:49:00,640
So contrastive method consists in basically generating contrastive pairs of X and Y that are not in the training set.
871
00:49:01,520 --> 00:49:04,560
So pick an X and pick another Y that's not compatible with it.
872
00:49:04,560 --> 00:49:06,920
And that gives you one of those green dots that you see flashing.
873
00:49:08,040 --> 00:49:13,860
And your loss function is going to consist in pushing down on the energy of the blue dots, which are the training samples,
874
00:49:14,040 --> 00:49:17,760
and then pushing up on the energy of the green dots, which are those contrastive samples.
875
00:49:17,760 --> 00:49:24,120
Okay, this is a good idea and there's a bunch of algorithms that people have used to train this.
876
00:49:24,120 --> 00:49:29,200
Some of them, for example, for joint embedding between images and text, are things like Clip
877
00:49:29,200 --> 00:49:36,960
from OpenAI. They use contrasting methods. Seem clear from a team at Google that includes Jeff
878
00:49:36,960 --> 00:49:43,980
Hinton. And then Siamese Nets back from the 90s that I used to advocate. The issue with contrasting
879
00:49:43,980 --> 00:49:47,980
methods is that the intrinsic dimension of the embedding that they produce is
880
00:49:47,980 --> 00:49:53,460
usually fairly low and so the representations that are learned by it
881
00:49:53,460 --> 00:49:57,480
are kind of degenerate a little bit. So I prefer the regularized method. What is
882
00:49:57,480 --> 00:50:02,100
the idea behind the regularized method? The idea is that you minimize the volume
883
00:50:02,100 --> 00:50:07,980
of space that can take low energy. So you have some sort of regularizer term in
884
00:50:07,980 --> 00:50:11,580
your loss function and that term basically measures the volume of stuff
885
00:50:11,580 --> 00:50:17,180
that has low energy and you try to minimize it. So what that means is that whenever you push down
886
00:50:17,180 --> 00:50:22,140
the energy of one region of that space, the rest has to go up because there's only a limited amount
887
00:50:22,140 --> 00:50:29,740
of low energy volume to go around. And you know that sounds a little abstract and mysterious,
888
00:50:29,740 --> 00:50:35,660
but in practice the way you do this is there's like a handful of methods to do this,
889
00:50:35,660 --> 00:50:39,660
which I'm going to explain in a second. Before that I'm going to tell you how you test how well
890
00:50:39,660 --> 00:50:40,840
how those systems work, right?
891
00:50:40,840 --> 00:50:43,640
So in the context of image recognition,
892
00:50:43,640 --> 00:50:46,240
you give two images that you know are the same image,
893
00:50:46,240 --> 00:50:48,740
either, so you take an image and you corrupt it,
894
00:50:48,740 --> 00:50:50,980
or you transform it in some way.
895
00:50:50,980 --> 00:50:52,660
You change the scale, you rotate it,
896
00:50:52,660 --> 00:50:53,820
you change the colors a little bit,
897
00:50:53,820 --> 00:50:56,060
maybe you mask parts of it, okay?
898
00:50:56,060 --> 00:50:58,840
And then you train an encoder on a predictor
899
00:50:58,840 --> 00:51:01,020
so that the predictor predicts the representation
900
00:51:01,020 --> 00:51:03,220
of the full image from the representation
901
00:51:03,220 --> 00:51:05,940
of the corrupted one.
902
00:51:05,940 --> 00:51:07,600
And then once the system is trained,
903
00:51:07,600 --> 00:51:09,020
you chop off the predictor,
904
00:51:09,020 --> 00:51:11,520
you use the encoder as input to a classifier,
905
00:51:11,520 --> 00:51:14,440
and you train a supervised classifier to do things
906
00:51:14,440 --> 00:51:17,020
like object recognition or something of that type.
907
00:51:17,020 --> 00:51:19,940
So that's a way of measuring the quality of the features
908
00:51:19,940 --> 00:51:24,060
that have been learned by the system.
909
00:51:24,060 --> 00:51:28,560
There's been a number of papers on this,
910
00:51:28,560 --> 00:51:33,220
and what has been transpiring is that those methods work really
911
00:51:33,220 --> 00:51:36,180
well to train a system to extract
912
00:51:36,180 --> 00:51:37,660
Generate features from images,
913
00:51:37,660 --> 00:51:39,660
the joint embedding architectures.
914
00:51:39,660 --> 00:51:42,020
There's been a lot of work also on
915
00:51:42,020 --> 00:51:45,080
generative architectures like autoencoders,
916
00:51:45,080 --> 00:51:47,500
variational autoencoders, VQVAEs,
917
00:51:47,500 --> 00:51:49,620
masked autoencoders, denosing autoencoders,
918
00:51:49,620 --> 00:51:51,780
all kinds of techniques of this type that basically,
919
00:51:51,780 --> 00:51:53,860
you give a corrupted version of an image,
920
00:51:53,860 --> 00:51:55,460
and then you train the system to
921
00:51:55,460 --> 00:51:57,780
recover the full image at the pixel level.
922
00:51:57,780 --> 00:52:00,180
Those methods do not work nearly as
923
00:52:00,180 --> 00:52:02,260
well as the joint embedding methods.
924
00:52:02,260 --> 00:52:04,700
We discovered this five or six years ago,
925
00:52:04,700 --> 00:52:09,460
not just us, but there was an accumulating amount of evidence showing that joint invading
926
00:52:09,460 --> 00:52:17,300
was really superior to reconstruction based systems, so to generative architectures.
927
00:52:17,300 --> 00:52:20,760
And at the time, the methods for training were only contrastive.
928
00:52:20,760 --> 00:52:25,260
But now we've found some other techniques, and one technique in particular that, or one
929
00:52:25,260 --> 00:52:30,940
set of techniques that attempt to maximize some measure of information, information content
930
00:52:30,940 --> 00:52:32,640
coming out of the encoder.
931
00:52:32,640 --> 00:52:36,320
So one of the criteria used for training is this minus i,
932
00:52:36,320 --> 00:52:38,040
the measure of information content.
933
00:52:38,040 --> 00:52:39,660
Since we minimize cost function,
934
00:52:39,660 --> 00:52:40,720
there is a minus sign in front,
935
00:52:40,720 --> 00:52:42,760
so you maximize information content.
936
00:52:42,760 --> 00:52:44,500
How do we do this?
937
00:52:44,500 --> 00:52:47,060
So one simple trick that we've used is something called
938
00:52:47,060 --> 00:52:49,640
variance covariance regularization.
939
00:52:49,640 --> 00:52:52,540
Or in the case where you don't have predictor,
940
00:52:52,540 --> 00:52:55,880
it's Vcreg, variance invariance covariance regularization.
941
00:52:55,880 --> 00:52:57,900
And there the idea is you take
942
00:52:57,900 --> 00:53:00,260
the representation coming out of the encoder and you say,
943
00:53:00,260 --> 00:53:04,900
First of all, you should not collapse to a fixed set of values.
944
00:53:04,900 --> 00:53:07,500
So the variance of each variable coming out of
945
00:53:07,500 --> 00:53:10,600
the encoder should be at least one, let's say.
946
00:53:10,600 --> 00:53:13,400
Okay. Now the system can still cheat and not produce
947
00:53:13,400 --> 00:53:16,020
very informative outputs by basically producing
948
00:53:16,020 --> 00:53:18,860
the same variable or very correlated variable for
949
00:53:18,860 --> 00:53:22,620
all the dimensions of the output representation.
950
00:53:22,620 --> 00:53:26,900
So another criterion tries to decorrelate those variables.
951
00:53:26,900 --> 00:53:29,760
And in fact, we use a trick that we expand the dimension,
952
00:53:29,760 --> 00:53:32,200
We take the representation, run it through a neural net
953
00:53:32,200 --> 00:53:33,680
that expands the dimension,
954
00:53:33,680 --> 00:53:35,000
and then decorrelate in that space,
955
00:53:35,000 --> 00:53:37,000
and that has the effect of actually making
956
00:53:37,000 --> 00:53:39,620
the original variable more independent of each other,
957
00:53:39,620 --> 00:53:41,080
not just uncorrelated.
958
00:53:41,960 --> 00:53:43,920
So it's a bit of a hack,
959
00:53:43,920 --> 00:53:46,000
because what we're trying to do here
960
00:53:46,000 --> 00:53:47,680
is maximizing information content,
961
00:53:47,680 --> 00:53:49,620
and what we should have to be able to do this
962
00:53:49,620 --> 00:53:52,280
is a lower bound on information content.
963
00:53:52,280 --> 00:53:54,040
But what I'm describing here
964
00:53:54,040 --> 00:53:56,680
is an upper bound on information content.
965
00:53:56,680 --> 00:53:58,280
So we're maximizing an upper bound,
966
00:53:58,280 --> 00:54:05,720
Then we cross our fingers that the actual information content will follow.
967
00:54:05,720 --> 00:54:06,520
Okay.
968
00:54:06,520 --> 00:54:09,720
And it works.
969
00:54:09,720 --> 00:54:13,880
So that's one set of techniques.
970
00:54:13,880 --> 00:54:15,160
I'm going to skip the theory.
971
00:54:15,160 --> 00:54:18,200
There is another set of method called distillations,
972
00:54:18,200 --> 00:54:19,880
and those have proved to be extremely efficient.
973
00:54:21,080 --> 00:54:25,160
And there, it's another hack, and we only have partial,
974
00:54:25,160 --> 00:54:29,400
At least in my opinion partial theoretical understanding of why it works, but it does work.
975
00:54:30,760 --> 00:54:35,640
In there we share the weights between the two encoders with a technique called exponential
976
00:54:35,640 --> 00:54:40,440
moving average. So one encoder has the weights that are basically a temporal average of the
977
00:54:40,440 --> 00:54:44,680
weights of the other one for mysterious reasons. And we train the whole thing but we don't back
978
00:54:44,680 --> 00:54:50,280
propagate gradient to the one that gets this moving average that gets the full input.
979
00:54:50,280 --> 00:54:54,180
And somehow this does not collapse and it works really well.
980
00:54:54,180 --> 00:54:56,020
It's called a distillation method.
981
00:54:56,020 --> 00:54:58,020
There's various versions of it.
982
00:54:58,020 --> 00:55:05,780
Cinsiam, BYOL from DeepMind, Dinov2 from my colleagues in Paris at Meta, iJepa and VJepa
983
00:55:05,780 --> 00:55:09,780
from the people at Meta who work with me.
984
00:55:09,780 --> 00:55:10,780
This works amazingly well.
985
00:55:10,780 --> 00:55:16,300
It works so well, in fact, the Dinov2 version works incredibly well.
986
00:55:16,300 --> 00:55:18,780
It's a generic feature extractor for images.
987
00:55:18,780 --> 00:55:21,580
If you have some random computer vision problem,
988
00:55:21,580 --> 00:55:23,540
and no one has trained a system for that,
989
00:55:23,540 --> 00:55:26,020
just download Dinov2, it will extract features
990
00:55:26,020 --> 00:55:28,280
from your images, and then train a very simple
991
00:55:28,280 --> 00:55:30,780
classifier head on top of it with just a few examples,
992
00:55:30,780 --> 00:55:33,960
and it will likely solve your vision problem.
993
00:55:33,960 --> 00:55:36,320
An example of this is, I'm not gonna bore you
994
00:55:36,320 --> 00:55:39,200
with tables of results, but example of this
995
00:55:39,200 --> 00:55:42,180
is a collaborator at Meta, Camille Coupri,
996
00:55:42,180 --> 00:55:47,020
who got satellite imaging images of the entire world,
997
00:55:47,020 --> 00:55:50,020
you know, in various frequency bands.
998
00:55:50,020 --> 00:55:52,020
And she also got LiDAR data.
999
00:55:52,020 --> 00:55:55,020
So the LiDAR data gives you, for a little piece of the world,
1000
00:55:55,020 --> 00:56:00,020
LiDAR data gives you the height of the canopy of vegetation.
1001
00:56:00,020 --> 00:56:05,020
And so she took the Dino features, applied them to the entire world,
1002
00:56:05,020 --> 00:56:09,020
and then used a trained classifier that was trained on the LiDAR data,
1003
00:56:09,020 --> 00:56:12,020
on the small amount of data, but applied it to the entire world.
1004
00:56:12,020 --> 00:56:16,020
And now what she has is an estimate of the height of the canopy for the entire Earth.
1005
00:56:16,020 --> 00:56:23,220
What that allows to compute is the an estimate of the amount of carbon captured in vegetation,
1006
00:56:23,220 --> 00:56:29,220
which is a very interesting piece of data for climate change. So that's an example. There's
1007
00:56:29,220 --> 00:56:34,340
other examples in medical imaging, in biological imaging, where Dino has been used for some success.
1008
00:56:35,060 --> 00:56:39,940
But this distillation method called IGEPA that I briefly described earlier works extremely well
1009
00:56:39,940 --> 00:56:45,620
to learn visual features. Again, I'm not going to bore you with details. It's really much better than
1010
00:56:45,620 --> 00:56:48,560
and the methods that are based on reconstruction.
1011
00:56:48,560 --> 00:56:52,860
Of course, the next thing we did was try to apply this to video.
1012
00:56:52,860 --> 00:56:54,120
Can we apply this to video?
1013
00:56:54,120 --> 00:56:56,360
So it turns out if you train a system of this type to make
1014
00:56:56,360 --> 00:56:57,660
temporal prediction in video,
1015
00:56:57,660 --> 00:56:58,880
it doesn't work very well.
1016
00:56:58,880 --> 00:57:02,420
You have to make it do spatial prediction,
1017
00:57:02,420 --> 00:57:04,000
which is very strange.
1018
00:57:04,000 --> 00:57:06,840
There, the features that are learned are really great.
1019
00:57:06,840 --> 00:57:10,640
You get good performance for that system when you use the
1020
00:57:10,640 --> 00:57:13,560
representation to classify actions in
1021
00:57:13,560 --> 00:57:16,060
in videos and things of that type.
1022
00:57:17,120 --> 00:57:21,540
We even have tests now that the paper is being completed
1023
00:57:21,540 --> 00:57:24,520
that show that those systems have some level of common sense
1024
00:57:24,520 --> 00:57:25,460
and physical intuition.
1025
00:57:25,460 --> 00:57:27,880
It shows them videos that are impossible because,
1026
00:57:27,880 --> 00:57:30,260
for example, an object spontaneously disappears
1027
00:57:30,260 --> 00:57:31,300
or something like that.
1028
00:57:31,300 --> 00:57:32,940
They say, whoa, something strange happened.
1029
00:57:32,940 --> 00:57:34,160
Their prediction error goes up.
1030
00:57:34,160 --> 00:57:37,660
And so those systems really are able to learn
1031
00:57:37,660 --> 00:57:39,960
some basic concepts about the world.
1032
00:57:39,960 --> 00:57:50,280
But then the last thing I want to say is systems of this type that are capable of, that basically
1033
00:57:50,280 --> 00:57:53,240
we can use to train a world model and we can use those world models for planning.
1034
00:57:53,240 --> 00:57:54,240
So this is new.
1035
00:57:54,240 --> 00:57:57,240
I haven't presented this yet.
1036
00:57:57,240 --> 00:58:03,760
The paper has been submitted, but this is the first time I talk publicly in English about
1037
00:58:03,760 --> 00:58:04,760
it.
1038
00:58:09,960 --> 00:58:16,680
the preview. So this is work by a student, PhD student NYU,
1039
00:58:16,680 --> 00:58:21,880
Gauru Eijou, who is co-advised by Masef and Lara Pinto, and she did a lot of this work
1040
00:58:21,880 --> 00:58:31,080
while she was an intern at Meta, and Hengai Pan, who's also a student. And the basic architecture
1041
00:58:31,080 --> 00:58:37,240
here is that we use the features from Dinov2, okay, pre-trained, and we train a world model on
1042
00:58:37,240 --> 00:58:39,440
on top of it, which is action conditioned.
1043
00:58:39,440 --> 00:58:44,240
So basically, we take a picture of the world,
1044
00:58:44,240 --> 00:58:46,740
or the environment, whatever it is,
1045
00:58:46,740 --> 00:58:50,540
and then feed an action that we're going to take in
1046
00:58:50,540 --> 00:58:53,240
that environment and then observe
1047
00:58:53,240 --> 00:58:57,540
the result in the environment in terms of Dino features,
1048
00:58:57,540 --> 00:59:00,500
and then train the predictor to predict
1049
00:59:00,500 --> 00:59:03,860
the representation after the action as
1050
00:59:03,860 --> 00:59:05,380
a function of the input,
1051
00:59:05,380 --> 00:59:07,700
the previous state and the action.
1052
00:59:07,700 --> 00:59:10,220
So the predictor function takes
1053
00:59:10,220 --> 00:59:11,860
the previous state and the action and predicts
1054
00:59:11,860 --> 00:59:13,700
the next state essentially.
1055
00:59:13,700 --> 00:59:15,420
Then once we have that system,
1056
00:59:15,420 --> 00:59:18,620
we can do this optimization procedure I was telling you about,
1057
00:59:18,620 --> 00:59:22,460
to plan a sequence of actions to arrive at a particular result.
1058
00:59:22,460 --> 00:59:25,700
The result is simply a Euclidean distance
1059
00:59:25,700 --> 00:59:27,220
between a predicted state,
1060
00:59:27,220 --> 00:59:29,700
end state, and a target state.
1061
00:59:29,700 --> 00:59:32,060
The way we compute the target state is that we show
1062
00:59:32,060 --> 00:59:33,740
an image to the encoder and we tell it,
1063
00:59:33,740 --> 00:59:37,220
you know, this representation is your target representation.
1064
00:59:37,220 --> 00:59:40,060
Take a sequence of actions so that the predicted state
1065
00:59:40,060 --> 00:59:42,540
matches that state.
1066
00:59:43,480 --> 00:59:45,360
So we've tried this on several tasks.
1067
00:59:45,360 --> 00:59:46,940
So one of them is just, you know,
1068
00:59:46,940 --> 00:59:49,620
moving a dot through a simple maze.
1069
00:59:49,620 --> 00:59:52,260
Another one is moving a little,
1070
00:59:52,260 --> 00:59:53,500
let me repeat this video,
1071
00:59:54,760 --> 00:59:59,260
moving a little T object by pushing on it in various places
1072
00:59:59,260 --> 01:00:01,180
so that it's in a particular position.
1073
01:00:01,180 --> 01:00:02,640
That's called a push T problem.
1074
01:00:02,640 --> 01:00:07,640
And then other task of navigating through the environment,
1075
01:00:07,640 --> 01:00:09,200
going through a door in a wall,
1076
01:00:09,200 --> 01:00:12,480
and then pushing on sort of deformable objects
1077
01:00:12,480 --> 01:00:14,220
so they adopt a particular shape.
1078
01:00:14,220 --> 01:00:16,100
Okay, and I'll show you a more impressive example
1079
01:00:16,100 --> 01:00:16,860
in this one.
1080
01:00:16,860 --> 01:00:20,660
Okay, so the task, we can collect artificial data
1081
01:00:20,660 --> 01:00:23,760
because those are virtual environments
1082
01:00:23,760 --> 01:00:25,160
that we can simulate.
1083
01:00:25,160 --> 01:00:26,780
And then we experimented with various systems
1084
01:00:26,780 --> 01:00:30,640
that have been proposed in the past to solve that problem.
1085
01:00:30,640 --> 01:00:36,000
Dreamer V3 is probably one of the most advanced one from DeepMind,
1086
01:00:36,000 --> 01:00:39,000
from Danny R. Hefner at DeepMind.
1087
01:00:39,000 --> 01:00:42,200
And what you see here is visualization through
1088
01:00:42,200 --> 01:00:45,600
a decoder of the predicted state for a sequence of actions.
1089
01:00:45,600 --> 01:00:47,240
So at the top is a ground truth.
1090
01:00:47,240 --> 01:00:53,240
You execute a sequence of actions and see the result in the simulator.
1091
01:00:53,240 --> 01:00:58,280
And then each row is the result of a prediction by one of those models.
1092
01:00:58,280 --> 01:01:01,280
And what you see is some predictions become blurry,
1093
01:01:01,280 --> 01:01:04,280
some predictions become kind of weird.
1094
01:01:04,280 --> 01:01:08,280
Ours is pretty good, Iris is pretty good,
1095
01:01:08,280 --> 01:01:12,280
Dreamer v3 not so great.
1096
01:01:12,280 --> 01:01:14,280
This is the most interesting task.
1097
01:01:14,280 --> 01:01:17,280
It's called the granular environment,
1098
01:01:17,280 --> 01:01:21,280
and it's basically a bunch of blue chips on the table.
1099
01:01:21,280 --> 01:01:24,280
And an action is a motion by a robot arm,
1100
01:01:24,280 --> 01:01:26,280
which goes down on the table,
1101
01:01:26,280 --> 01:01:29,740
moves by some Delta X, Delta Y, and then lifts.
1102
01:01:29,740 --> 01:01:31,620
That's an action, it's four numbers.
1103
01:01:31,620 --> 01:01:38,520
X, Y, where you touch the table, Delta X, Delta Y, lift.
1104
01:01:38,520 --> 01:01:41,600
Okay. The question is,
1105
01:01:41,600 --> 01:01:45,000
so you can train a world model by just putting
1106
01:01:45,000 --> 01:01:47,180
a bunch of chips in random position and then taking
1107
01:01:47,180 --> 01:01:49,000
a random action and then observing the result,
1108
01:01:49,000 --> 01:01:50,980
and you train the predictor this way.
1109
01:01:50,980 --> 01:01:53,960
Once the predictor is trained,
1110
01:01:53,960 --> 01:01:58,280
So those are results of various techniques of planning.
1111
01:01:58,280 --> 01:02:00,960
So you can use the world model for planning a sequence of
1112
01:02:00,960 --> 01:02:03,400
actions to arrive at a particular goal.
1113
01:02:03,400 --> 01:02:05,680
So this is for a point-based Christian world,
1114
01:02:05,680 --> 01:02:08,380
but you might want to look at the other one, the granular.
1115
01:02:08,380 --> 01:02:14,580
So this is the, what's called a chamfer distance between
1116
01:02:14,580 --> 01:02:22,900
the end state in the image space of all the grains, if you want,
1117
01:02:22,900 --> 01:02:27,320
and the target measured through a chamfer distance.
1118
01:02:27,320 --> 01:02:29,080
And what you see is the, our method,
1119
01:02:29,080 --> 01:02:30,340
which is the blue one,
1120
01:02:30,340 --> 01:02:32,740
has much, much lower final error
1121
01:02:32,740 --> 01:02:34,760
than the other methods that we compared it with,
1122
01:02:34,760 --> 01:02:37,180
Dreamer v3 and TDNPC2.
1123
01:02:37,180 --> 01:02:40,660
And TDNPC2 is a method that actually requires,
1124
01:02:40,660 --> 01:02:42,100
needs to be task specific,
1125
01:02:42,100 --> 01:02:45,400
so it's not as general as Dino World Model.
1126
01:02:46,700 --> 01:02:49,820
So here's a little demo of the system in action
1127
01:02:49,820 --> 01:02:52,120
for the various tasks.
1128
01:02:52,120 --> 01:02:53,800
Let me play this again.
1129
01:02:53,800 --> 01:02:55,380
Look at the push T.
1130
01:02:55,380 --> 01:03:00,380
Okay, so you see the dot moving in discrete steps
1131
01:03:01,580 --> 01:03:04,840
because for every tick of the simulation,
1132
01:03:04,840 --> 01:03:07,540
there is the same action is repeated five times.
1133
01:03:07,540 --> 01:03:09,560
So the actions are only produced
1134
01:03:09,560 --> 01:03:11,220
like every five time steps.
1135
01:03:11,220 --> 01:03:13,240
But it gets to the target.
1136
01:03:13,240 --> 01:03:17,480
The target is represented on the right,
1137
01:03:17,480 --> 01:03:19,300
and it actually kind of presents.
1138
01:03:19,300 --> 01:03:22,500
So this is for the granular in particular.
1139
01:03:22,500 --> 01:03:26,400
So the target is represented at the right.
1140
01:03:26,400 --> 01:03:28,600
And let me play this again.
1141
01:03:28,600 --> 01:03:32,020
We start from a random configuration of the chips,
1142
01:03:32,020 --> 01:03:33,620
and the system kind of pushes
1143
01:03:33,620 --> 01:03:35,080
the chips using those actions.
1144
01:03:35,080 --> 01:03:35,980
You don't see the actions,
1145
01:03:35,980 --> 01:03:38,940
but you only see the result by pushing
1146
01:03:38,940 --> 01:03:40,220
them so that they look like a square.
1147
01:03:40,220 --> 01:03:42,380
Now what's interesting about this is that it's
1148
01:03:42,380 --> 01:03:43,680
completely open loop.
1149
01:03:43,680 --> 01:03:48,300
So the system basically looks at the initial condition,
1150
01:03:48,300 --> 01:03:49,820
imagine the sequence of actions,
1151
01:03:49,820 --> 01:03:52,280
and then executes those actions blindly.
1152
01:03:52,280 --> 01:03:54,080
What you see here is a result of
1153
01:03:54,080 --> 01:03:56,500
executing those actions, open loop,
1154
01:03:56,500 --> 01:03:58,360
closing your eyes.
1155
01:03:58,360 --> 01:04:00,360
It's pretty cool.
1156
01:04:00,360 --> 01:04:03,220
All right, coming to the end now.
1157
01:04:03,220 --> 01:04:07,160
So I have five recommendations.
1158
01:04:07,160 --> 01:04:12,180
Abandoned generative models in favor of those JEPA.
1159
01:04:12,180 --> 01:04:14,580
Abandoned probabilistic models in favor of
1160
01:04:14,580 --> 01:04:15,620
those energy-based models.
1161
01:04:15,620 --> 01:04:17,900
So something I haven't said is that in this context,
1162
01:04:17,900 --> 01:04:20,460
you can't really do probabilistic modeling,
1163
01:04:20,460 --> 01:04:21,360
it's intractable.
1164
01:04:22,640 --> 01:04:24,820
Abandon contrastive methods
1165
01:04:24,820 --> 01:04:28,720
in favor of those regularized methods.
1166
01:04:28,720 --> 01:04:30,200
And of course, abandon reinforcement learning,
1167
01:04:30,200 --> 01:04:32,140
but that I've been saying for 10 years.
1168
01:04:33,480 --> 01:04:36,180
And so if you're interested in human level AI,
1169
01:04:36,180 --> 01:04:37,940
don't work on LLMs.
1170
01:04:37,940 --> 01:04:40,800
You're a grad student, you're studying a PhD in AI,
1171
01:04:40,800 --> 01:04:42,220
do not work on LLMs.
1172
01:04:44,180 --> 01:04:45,240
It's not interesting.
1173
01:04:45,240 --> 01:04:51,640
I mean, first of all, it's not that interesting because it's not going to be the next revolution in AI.
1174
01:04:51,640 --> 01:04:55,640
It's not going to help systems understand the physical world and everything.
1175
01:04:55,640 --> 01:05:05,840
But it's also a very dangerous thing to do because there is enormous teams in industry with billions of dollars of resources working on this.
1176
01:05:05,840 --> 01:05:09,040
There's nothing you can bring to the table. Absolutely nothing.
1177
01:05:15,240 --> 01:05:20,780
working on LLMs, but the lifetime of this is going to be three years.
1178
01:05:20,780 --> 01:05:26,420
Three, five years from now, my prediction is no one in their right mind would use LLMs
1179
01:05:26,420 --> 01:05:27,880
in the form that they exist today.
1180
01:05:27,880 --> 01:05:30,360
I mean, they would be used as a component of a bigger system,
1181
01:05:30,360 --> 01:05:33,820
but the main architecture would be different.
1182
01:05:35,760 --> 01:05:38,320
There's a lot of problems to solve with this,
1183
01:05:38,320 --> 01:05:41,700
which I kind of sweat under the rug,
1184
01:05:41,700 --> 01:05:43,700
and I'm not going to go through the laundry list,
1185
01:05:43,700 --> 01:05:45,740
but we don't know how to do hierarchical planning,
1186
01:05:45,740 --> 01:05:47,880
for example. So here is a good PhD topic,
1187
01:05:47,880 --> 01:05:49,520
if you're interested in this.
1188
01:05:49,520 --> 01:05:54,240
Just try to crack the nut of hierarchical planning.
1189
01:05:56,340 --> 01:05:59,340
There's all kinds of foundation,
1190
01:05:59,340 --> 01:06:01,720
theoretical issues with what I talked about here,
1191
01:06:01,720 --> 01:06:03,600
and energy-based models and things like this.
1192
01:06:03,600 --> 01:06:05,840
How to design objectives for SSL so
1193
01:06:05,840 --> 01:06:08,600
that the systems are driven to learn the right thing.
1194
01:06:08,600 --> 01:06:11,760
I've only talked about information maximization,
1195
01:06:11,760 --> 01:06:13,560
but there is all kinds of other things.
1196
01:06:13,560 --> 01:06:17,720
It's a little bit of RL you might need to do for adjusting the world model in real time.
1197
01:06:18,840 --> 01:06:24,200
But then, if we succeed in this program, which may take the better part of the next decade,
1198
01:06:25,080 --> 01:06:34,200
we might have virtual assistant that has human level AI. What I think though, is that those
1199
01:06:34,200 --> 01:06:39,080
platforms need to be open source. And so this is the political part of the talk, which is going to
1200
01:06:39,080 --> 01:06:45,480
be very short. You know, we need, those platforms are incredibly, you know, LLMs or future AI
1201
01:06:45,480 --> 01:06:51,800
systems are incredibly expensive to train, the basic foundation models. So only a few companies
1202
01:06:51,800 --> 01:06:58,120
in the world can do it. And the problem that we're facing now is that the publicly available
1203
01:06:58,120 --> 01:07:04,360
data on the internet is not what we want, because it's mostly English. I mean, there is other
1204
01:07:04,360 --> 01:07:09,400
other languages obviously, but for various reasons, regulatory reasons, all kinds of problems,
1205
01:07:09,400 --> 01:07:17,860
you do not have access to all the data in the world. Of every language in the world,
1206
01:07:17,860 --> 01:07:23,740
there is 4,000 languages or something like that that people use. All the cultures, all
1207
01:07:23,740 --> 01:07:31,140
the value systems, all the centers of interest, you just don't have all the data available.
1208
01:07:31,140 --> 01:07:35,740
So the future is one in which those systems would not be trained by a single company.
1209
01:07:35,740 --> 01:07:40,980
They will be trained in a distributed manner so that you all have big data centers in various
1210
01:07:40,980 --> 01:07:41,980
parts of the world.
1211
01:07:41,980 --> 01:07:47,640
They have access to local data, but they all contribute to training a large model that
1212
01:07:47,640 --> 01:07:54,140
will be worldwide and will eventually constitute the repository of all human knowledge.
1213
01:07:54,140 --> 01:07:58,460
This is a very lofty goal to try to attain, right?
1214
01:07:58,460 --> 01:08:02,040
Having a system that basically constitutes a repository of all human knowledge, but it's
1215
01:08:02,040 --> 01:08:06,900
a system you can talk to, you can ask questions to, it can serve as a tutor, as a professor
1216
01:08:06,900 --> 01:08:13,140
maybe, put a lot of us here at our job.
1217
01:08:13,140 --> 01:08:15,460
It's a thing that we should really work towards.
1218
01:08:15,460 --> 01:08:21,600
It will amplify human intelligence, improve rational thought perhaps.
1219
01:08:21,600 --> 01:08:23,080
But it needs to be diverse also.
1220
01:08:28,460 --> 01:08:31,060
a handful of companies on the West Coast of the US.
1221
01:08:31,060 --> 01:08:32,260
That's completely unacceptable
1222
01:08:32,260 --> 01:08:34,060
to a lot of governments in the world,
1223
01:08:35,060 --> 01:08:37,040
democratic governments, right?
1224
01:08:37,040 --> 01:08:39,680
You need a diversity of AI assistance
1225
01:08:39,680 --> 01:08:41,420
for the same reason you need a diversity
1226
01:08:41,420 --> 01:08:44,720
of newspapers, magazines and the press.
1227
01:08:44,720 --> 01:08:47,360
You need a free press with diversity.
1228
01:08:48,380 --> 01:08:51,980
And we need free AI with diversity as well.
1229
01:08:58,460 --> 01:09:01,780
in AI, some of them are worried about the dangers
1230
01:09:01,780 --> 01:09:04,880
of making AI technology available to everyone.
1231
01:09:04,880 --> 01:09:09,100
I think the benefits far outweigh the dangers and the risks.
1232
01:09:10,040 --> 01:09:13,940
In fact, I think the main risk of AI in the future
1233
01:09:13,940 --> 01:09:17,720
is will happen if AI is controlled
1234
01:09:17,720 --> 01:09:19,900
by a small number of commercial companies
1235
01:09:19,900 --> 01:09:22,900
that don't reveal how their AI systems work.
1236
01:09:22,900 --> 01:09:24,340
I think that's very dangerous.
1237
01:09:24,340 --> 01:09:33,940
So attempts to minimize the risk of AI by basically making open source AI illegal,
1238
01:09:33,940 --> 01:09:39,700
I think it completely misdirected and it will actually reach the opposite result of the intended one.
1239
01:09:39,700 --> 01:09:42,580
It will make AI less safe.
1240
01:09:42,580 --> 01:09:50,780
So open research, open source AI must not be regulated out of existence.
1241
01:09:50,780 --> 01:09:53,460
A lot of politicians need to understand this.
1242
01:09:53,660 --> 01:09:57,700
There's an alliance of various companies that are really kind of subscribed to this model,
1243
01:09:57,700 --> 01:10:03,620
Meta, IBM, Intel, Sony, a lot of people in academia, a lot of startups, venture capitalists,
1244
01:10:03,620 --> 01:10:10,800
etc. And then a few companies who are kind of advocating for the opposite. That will
1245
01:10:10,800 --> 01:10:18,820
remain nameless. So, you know, perhaps if we do it right, we'll have systems that will
1246
01:10:18,820 --> 01:10:23,420
amplify human intelligence, as I was saying at the beginning of the talk. And this may
1247
01:10:23,420 --> 01:10:29,980
Bring about a new renaissance for humanity, you know, similar to what happened with the printing press in the 15th century.
1248
01:10:30,800 --> 01:10:35,180
And on this cosmic conclusion, I will thank you very much.
1249
01:10:47,380 --> 01:10:50,720
And by the way, these are pictures I took from my backyard in New Jersey.
1250
01:10:50,720 --> 01:10:59,040
Thank you, Jan. So Jan will take a few questions now. And for people who are leaving,
1251
01:10:59,320 --> 01:11:04,760
please leave from the Broadway entrance. Do not leave from the campus entrance. But yeah,
1252
01:11:04,760 --> 01:11:10,820
questions? Please line up on the mics if you have questions.
1253
01:11:20,720 --> 01:11:29,260
No sound.
1254
01:11:30,400 --> 01:11:31,240
Yeah, it works.
1255
01:11:38,760 --> 01:11:39,320
Hi.
1256
01:11:40,200 --> 01:11:42,260
Ayaan, thank you for coming so much.
1257
01:11:42,900 --> 01:11:46,960
I wanted to ask for 3D vision models,
1258
01:11:46,960 --> 01:11:48,600
what do you see business applications
1259
01:11:48,600 --> 01:11:50,180
in the next seven, eight years?
1260
01:11:50,180 --> 01:11:56,240
Yeah, I haven't talked about 3D.
1261
01:11:56,240 --> 01:12:01,220
I mean, some of my colleagues think there is something very special about 3D.
1262
01:12:01,220 --> 01:12:03,320
I don't necessarily think that's the case.
1263
01:12:03,320 --> 01:12:08,840
I mean, we're hoping that the next generation of these VJPAD models will basically understand
1264
01:12:08,840 --> 01:12:12,720
the fact that the world is three-dimensional and there are objects in front of others and
1265
01:12:12,720 --> 01:12:13,720
things like that.
1266
01:12:13,720 --> 01:12:19,580
Now, there are applications for which you need 3D inference and reconstruction in 3D
1267
01:12:19,580 --> 01:12:22,520
If you want to have virtual objects in virtual environments
1268
01:12:22,520 --> 01:12:24,500
and things like this.
1269
01:12:24,500 --> 01:12:26,780
But frankly, I'm not a specialist.
1270
01:12:26,780 --> 01:12:29,000
I think there are specialists of that question here
1271
01:12:29,000 --> 01:12:31,380
at Columbia, actually.
1272
01:12:31,380 --> 01:12:32,700
Just one more question.
1273
01:12:32,700 --> 01:12:37,100
Do you really see that VJEPA models and DinoV2
1274
01:12:37,100 --> 01:12:40,080
having hierarchical planning like the kind you mentioned
1275
01:12:40,080 --> 01:12:41,240
earlier?
1276
01:12:41,240 --> 01:12:43,740
So it doesn't exist yet.
1277
01:12:43,740 --> 01:12:47,340
So this is something we're working on.
1278
01:12:47,340 --> 01:12:52,780
I hope we will get some results about this for you know in the next year or two something like that.
1279
01:12:53,580 --> 01:13:01,660
Thank you so much. Okay one question here. You talked about sorry
1280
01:13:06,220 --> 01:13:11,900
you talked about the benefits of AI and you think it's more beneficial than there are risks to it
1281
01:13:17,340 --> 01:13:25,900
West Coast, control the most advanced models. So why do you feel that the benefits outweigh the risks?
1282
01:13:25,900 --> 01:13:32,060
So that's not entirely true. Meta actually does not subscribe to this model that AI should be
1283
01:13:32,060 --> 01:13:38,620
proprietary and kept in its own hands. It releases a series of models called LAMA, right? So LAMA 1,
1284
01:13:38,620 --> 01:13:45,660
2, 3, 3.1, 3.2, which are state of the art or really close to it or better in certain measures.
1285
01:13:45,660 --> 01:13:51,900
And this is open source. It can be used freely by a lot of people around the world. It can be
1286
01:13:51,900 --> 01:14:01,740
fine-tuned for various languages or vertical applications. And it's...LAMA 3 has been
1287
01:14:01,740 --> 01:14:06,140
downloaded, I think, 400 million times or something like this. It's just insane. And
1288
01:14:06,140 --> 01:14:15,580
every single company I talk to has either deployed it or is about to deploy products based on LAMA.
1289
01:14:15,580 --> 01:14:22,580
There are people in Africa who are using it and training it to provide medical assistance, for example.
1290
01:14:22,580 --> 01:14:31,580
There's people in India that Mita is collaborating with so that future versions of Lama will speak all 22 official languages of India,
1291
01:14:31,580 --> 01:14:34,580
and perhaps at some point all the 1500 dialects or whatever.
1292
01:14:34,580 --> 01:14:40,580
So, you know, I think that's the way to make AI widely accessible to everyone in the world.
1293
01:14:40,580 --> 01:14:44,180
I mean, I'm really happy to be part of that effort.
1294
01:14:44,180 --> 01:14:47,800
I really wouldn't like to be part of kind of a closed effort.
1295
01:14:51,840 --> 01:14:52,340
Hi, Yan.
1296
01:14:52,340 --> 01:14:53,820
My name is Srikant.
1297
01:14:53,820 --> 01:14:55,560
I want to ask you, I'm curious to know
1298
01:14:55,560 --> 01:14:58,400
what you think about the capabilities of time series
1299
01:14:58,400 --> 01:15:01,460
foundation models, because I see that Amazon, Google,
1300
01:15:01,460 --> 01:15:04,400
Meta, everyone's trying to work in that domain.
1301
01:15:04,400 --> 01:15:07,100
But to me, intuitively, it feels like time series predictions
1302
01:15:07,100 --> 01:15:09,760
are a harder problem than language modeling.
1303
01:15:09,760 --> 01:15:12,860
What are your thoughts on the capabilities and limitations on this?
1304
01:15:12,860 --> 01:15:14,060
Yeah, okay.
1305
01:15:14,060 --> 01:15:18,580
I think you put your finger on an important point, which I forgot to mention.
1306
01:15:18,580 --> 01:15:24,580
The reason why language modeling works, why those predictive models that predict the next
1307
01:15:24,580 --> 01:15:27,980
word, the reason why they work for natural language and they don't work for images and
1308
01:15:27,980 --> 01:15:31,680
video, for example, is because language is discrete.
1309
01:15:31,680 --> 01:15:38,280
So to represent an uncertainty in the prediction when you have a discrete choice with a few
1310
01:15:38,280 --> 01:15:40,280
It's easy.
1311
01:15:40,280 --> 01:15:45,100
You just produce a distribution, probably distribution of all the possible outcomes.
1312
01:15:45,100 --> 01:15:46,100
And this is how LLMs work.
1313
01:15:46,100 --> 01:15:47,100
They are trained.
1314
01:15:47,100 --> 01:15:51,160
They actually produce a distribution over the next token.
1315
01:15:51,160 --> 01:15:57,300
You can't do this with continuous variables, particularly high dimensional continuous variables
1316
01:15:57,300 --> 01:16:00,180
like video pixels.
1317
01:16:00,180 --> 01:16:06,480
So there, we're not able to represent distributions efficiently in high dimensional continuous
1318
01:16:06,480 --> 01:16:10,800
spaces beyond like simple ones like Gaussians, right?
1319
01:16:10,800 --> 01:16:15,720
So my answer to this is don't do it.
1320
01:16:15,720 --> 01:16:18,100
Do prediction in representation space.
1321
01:16:18,100 --> 01:16:20,160
And then if you need to have actual prediction
1322
01:16:20,160 --> 01:16:22,540
of the time series, have a decoder that does that
1323
01:16:22,540 --> 01:16:23,100
separately.
1324
01:16:23,100 --> 01:16:26,100
But actually training a system to predict
1325
01:16:26,100 --> 01:16:28,680
high dimensional continuous thing by regression
1326
01:16:28,680 --> 01:16:32,380
when you have uncertainty simply doesn't work.
1327
01:16:32,380 --> 01:16:34,760
That's the evidence we have by trying to,
1328
01:16:34,760 --> 01:16:38,200
There was a huge project at Meta called Video MAE.
1329
01:16:38,200 --> 01:16:40,280
So the idea was, you know, take a video,
1330
01:16:40,280 --> 01:16:41,920
max some parts of it,
1331
01:16:41,920 --> 01:16:43,300
and then train some gigantic neural net
1332
01:16:43,300 --> 01:16:45,060
to predict the parts that are missing.
1333
01:16:45,060 --> 01:16:46,520
It was complete failure.
1334
01:16:46,520 --> 01:16:49,700
We abandoned that project.
1335
01:16:49,700 --> 01:16:52,440
We canceled it, because it was going nowhere, okay?
1336
01:16:52,440 --> 01:16:54,860
And this was really very large scale.
1337
01:16:54,860 --> 01:16:56,860
A lot of computing resources were devoted to this.
1338
01:16:56,860 --> 01:16:58,580
It just didn't work.
1339
01:16:58,580 --> 01:17:01,040
The J-Path stuff, though, does work.
1340
01:17:01,040 --> 01:17:03,660
So my hunch is that for time series,
1341
01:17:03,660 --> 01:17:07,900
It's probably a way to use kind of similar idea.
1342
01:17:07,900 --> 01:17:08,780
SPEAKER 1
1343
01:17:08,780 --> 01:17:09,380
OK, thank you.
1344
01:17:12,580 --> 01:17:12,620
SPEAKER 1
1345
01:17:12,620 --> 01:17:14,540
Great talk.
1346
01:17:14,540 --> 01:17:17,200
So my question is, I think I agree with your framework
1347
01:17:17,200 --> 01:17:18,840
for you have some world model and you
1348
01:17:18,840 --> 01:17:20,900
want to optimize via that world model
1349
01:17:20,900 --> 01:17:22,200
and how you train the world model.
1350
01:17:22,200 --> 01:17:24,980
But my question is, how do you get intelligence
1351
01:17:24,980 --> 01:17:28,720
when the world model is inconsistent with the truth?
1352
01:17:28,720 --> 01:17:31,440
So as an example, let's say your world model only
1353
01:17:31,440 --> 01:17:33,240
has classical mechanics.
1354
01:17:33,240 --> 01:17:35,460
how do you discover special relativity?
1355
01:17:35,460 --> 01:17:38,220
Humans have somehow broken that boundary,
1356
01:17:38,220 --> 01:17:39,440
but I don't know how you do that
1357
01:17:39,440 --> 01:17:42,360
when your world model is only based on observed data.
1358
01:17:43,320 --> 01:17:45,260
Well, I mean, the type of world model
1359
01:17:45,260 --> 01:17:46,840
we're talking about here is,
1360
01:17:48,120 --> 01:17:50,700
what I would be happy with before I retire
1361
01:17:50,700 --> 01:17:52,800
or before my brain turns into a bitch-a-mail sauce
1362
01:17:52,800 --> 01:17:57,800
is world models that are of the level of complexity
1363
01:17:58,020 --> 01:18:01,860
of a cat's world model, right, of the physical world,
1364
01:18:01,860 --> 01:18:03,400
Which is pretty sophisticated actually.
1365
01:18:03,400 --> 01:18:06,260
I mean, you can plan really complex actions.
1366
01:18:06,260 --> 01:18:07,360
So that's what we're talking about.
1367
01:18:07,360 --> 01:18:09,400
Now, you put your finger on something
1368
01:18:09,400 --> 01:18:11,340
that's really interesting, which is that,
1369
01:18:12,540 --> 01:18:15,940
which is a philosophical motivation behind JEPA,
1370
01:18:15,940 --> 01:18:20,140
and this idea that you need to lift the abstraction level
1371
01:18:21,580 --> 01:18:23,340
to be able to make predictions, right?
1372
01:18:24,540 --> 01:18:27,420
You cannot make predictions at the level of observation.
1373
01:18:27,420 --> 01:18:31,620
You have to find a good representation of reality
1374
01:18:31,620 --> 01:18:33,260
within which you can make predictions.
1375
01:18:33,260 --> 01:18:35,800
And that's the hardest problem really,
1376
01:18:35,800 --> 01:18:37,620
is to find that good representation space
1377
01:18:37,620 --> 01:18:38,940
that allows you to make predictions.
1378
01:18:38,940 --> 01:18:40,480
We do this all the time in science.
1379
01:18:40,480 --> 01:18:42,900
We do this all the time in everyday life without realizing,
1380
01:18:42,900 --> 01:18:45,160
but we do this all the time in science.
1381
01:18:46,620 --> 01:18:47,760
If we didn't need to do this,
1382
01:18:47,760 --> 01:18:52,760
we could explain human society with quantum field theory.
1383
01:18:54,180 --> 01:18:55,020
Right?
1384
01:18:55,020 --> 01:18:55,860
Right.
1385
01:18:55,860 --> 01:18:56,880
But we can't, right?
1386
01:18:56,880 --> 01:19:00,340
Because the gap, you know, in abstraction is so large, right?
1387
01:19:00,340 --> 01:19:05,380
So we go from quantum field theory to particle physics and from particles to atoms and from
1388
01:19:05,380 --> 01:19:11,060
atoms to molecules, from molecules to materials and from chemistry and you know blah blah blah
1389
01:19:11,060 --> 01:19:17,780
right and we go up the chain of abstraction so that at some level we have a representation
1390
01:19:17,780 --> 01:19:24,340
of physical objects and Newtonian mechanics and for you know large scale it would be relativity
1391
01:19:30,340 --> 01:19:34,880
human behavior, animal behavior, ecology, you know, this kind of stuff, right?
1392
01:19:34,880 --> 01:19:38,340
So we have all those levels of representation for which we have the,
1393
01:19:38,340 --> 01:19:42,400
for which the crucial insight is to actually find a representation.
1394
01:19:42,400 --> 01:19:45,780
For example, let's take a planet. Let's take Jupiter, okay?
1395
01:19:45,780 --> 01:19:47,700
Jupiter is an incredibly complex object.
1396
01:19:47,700 --> 01:19:52,080
It's got, you know, complicated composition.
1397
01:19:52,080 --> 01:19:55,480
It's got weather. It's got all kinds of gases swirling around.
1398
01:19:55,480 --> 01:20:00,480
And, you know, very complex object, right?
1399
01:20:02,180 --> 01:20:05,840
Now, who would have thought that the only thing you need
1400
01:20:05,840 --> 01:20:10,420
to predict the trajectory of Jupiter is six numbers?
1401
01:20:10,420 --> 01:20:13,300
You need three position, three velocities,
1402
01:20:13,300 --> 01:20:16,480
and you can predict the trajectory of Jupiter for centuries.
1403
01:20:18,460 --> 01:20:19,920
You know, that's a problem of learning
1404
01:20:19,920 --> 01:20:22,140
a good representation, right?
1405
01:20:22,140 --> 01:20:23,740
So, is the proposal essentially
1406
01:20:23,740 --> 01:20:26,500
to do this hierarchical planning with hierarchical world
1407
01:20:26,500 --> 01:20:27,500
models as well?
1408
01:20:27,500 --> 01:20:28,000
Yeah.
1409
01:20:28,000 --> 01:20:28,500
OK.
1410
01:20:28,500 --> 01:20:29,000
Exactly.
1411
01:20:29,000 --> 01:20:29,500
Awesome.
1412
01:20:29,500 --> 01:20:31,900
Have a system that can build multiple levels of abstractions.
1413
01:20:31,900 --> 01:20:32,480
Great.
1414
01:20:32,480 --> 01:20:32,940
Thanks.
1415
01:20:32,940 --> 01:20:36,040
Which is really the idea behind deep learning, by the way.
1416
01:20:36,040 --> 01:20:36,540
OK.
1417
01:20:36,540 --> 01:20:38,120
We'll have two more questions, then we'll stop.
1418
01:20:38,120 --> 01:20:40,600
So we'll take one from there and one from there.
1419
01:20:40,600 --> 01:20:40,880
Yeah.
1420
01:20:40,880 --> 01:20:41,700
Hi.
1421
01:20:41,700 --> 01:20:45,640
My question is about the one type of generative model
1422
01:20:45,640 --> 01:20:49,820
that you haven't covered, which is the diffusion models, which
1423
01:20:49,820 --> 01:20:56,560
I believe are quite different from the generative models
1424
01:20:56,560 --> 01:21:00,200
that you mentioned, because they are more implicit and
1425
01:21:00,200 --> 01:21:04,360
they don't predict the explicit probability distribution
1426
01:21:04,360 --> 01:21:09,180
like the LMS or VAEs or all the other generative one that you
1427
01:21:09,180 --> 01:21:13,880
mentioned. What are your perspective on the potential of
1428
01:21:13,880 --> 01:21:19,820
those models and especially with it has some attribute
1429
01:21:19,820 --> 01:21:26,540
to hierarchical planning as you said because when you use it for generating image, like
1430
01:21:26,540 --> 01:21:32,580
in the first few time steps, it actually generates like very high level details and then on the
1431
01:21:32,580 --> 01:21:37,060
later time step, it fills in the details, like the smaller details.
1432
01:21:37,060 --> 01:21:38,060
Yeah.
1433
01:21:38,060 --> 01:21:39,060
Okay.
1434
01:21:39,060 --> 01:21:41,480
So diffusion models can be seen as generative or not.
1435
01:21:41,480 --> 01:21:45,840
But the way to understand them, I think, is the following.
1436
01:21:45,840 --> 01:21:53,240
In a space of representation or images or whatever it is, you have, let's say, a manifold
1437
01:21:53,240 --> 01:21:56,080
of data.
1438
01:21:56,080 --> 01:22:00,420
Let's say natural images if you want to train an image generation system.
1439
01:22:00,420 --> 01:22:04,860
Or perhaps representations that are extracted by an encoder of the type that I talked about.
1440
01:22:04,860 --> 01:22:10,800
And those basically is a subset within the full space.
1441
01:22:10,800 --> 01:22:14,660
What a diffusion model does is that you give it a random vector in that space and it will
1442
01:22:14,660 --> 01:22:16,980
bring you back to that manifold.
1443
01:22:16,980 --> 01:22:21,800
Okay, and it will do this by training a vector field
1444
01:22:21,800 --> 01:22:26,300
so that at every location, random location in that space,
1445
01:22:26,300 --> 01:22:29,560
there is a vector that basically takes you back
1446
01:22:29,560 --> 01:22:32,900
to that manifold, perhaps in multiple steps.
1447
01:22:32,900 --> 01:22:34,760
Okay, that's what it does in the end.
1448
01:22:34,760 --> 01:22:36,860
It's trained in a particular way by reversing,
1449
01:22:38,960 --> 01:22:43,620
you know, a noisification chain, but that's what it does.
1450
01:22:43,620 --> 01:22:49,740
Now that's actually a particular way of implementing energy-based models of the types that I described.
1451
01:22:49,740 --> 01:22:53,520
Because you can think of this manifold of data as being kind of the minimum of an energy
1452
01:22:53,520 --> 01:22:54,600
function.
1453
01:22:54,600 --> 01:22:58,740
And if you had an energy function, you could compute the gradient of that energy function,
1454
01:22:58,740 --> 01:23:02,900
that gradient of the energy function will take you back to that manifold.
1455
01:23:02,900 --> 01:23:09,020
So that's the energy-based view of inference or denoising or restoration or whatever you
1456
01:23:09,020 --> 01:23:11,760
want.
1457
01:23:11,760 --> 01:23:17,760
And diffusion models basically instead of having an energy function that you compute
1458
01:23:17,760 --> 01:23:22,440
the gradient of, they directly learn the vector field that basically would be the gradient
1459
01:23:22,440 --> 01:23:24,440
of that energy function.
1460
01:23:24,440 --> 01:23:25,620
That's the way to understand it.
1461
01:23:25,620 --> 01:23:27,880
So it's not disconnected from what I talked about.
1462
01:23:27,880 --> 01:23:32,160
It can be used usefully in the context of what I talked about.
1463
01:23:32,160 --> 01:23:34,520
And what about nature?
1464
01:23:34,520 --> 01:23:35,520
Yeah.
1465
01:23:35,520 --> 01:23:38,520
My name is Leon.
1466
01:23:38,520 --> 01:23:40,960
I really want to thank you for the talk.
1467
01:23:40,960 --> 01:23:44,960
My question was sort of about these world models you were talking about,
1468
01:23:44,960 --> 01:23:50,960
especially in terms of trying to get to actual like cat level or animal type intelligence.
1469
01:23:50,960 --> 01:24:01,960
So like in terms of like a giraffe, as soon as it's born, something is in its mind that lets it be able to run or even walk within moments.
1470
01:24:01,960 --> 01:24:07,960
And I think part of it is because the world model it has constrains the type of actions it takes,
1471
01:24:07,960 --> 01:24:14,460
That kind of thing seems to be what you're almost doing with the dyno of trying to do these rule based approaches.
1472
01:24:15,000 --> 01:24:18,900
I'm just wondering how do these world models evolve over time?
1473
01:24:19,040 --> 01:24:22,680
Like how much variability does it have?
1474
01:24:22,680 --> 01:24:30,000
Yeah, I mean so clearly you need the world model to be adjusted as you go, right?
1475
01:24:37,960 --> 01:24:42,200
particular force to grab it, but then as I grab it, I realize it's not that full, so
1476
01:24:42,200 --> 01:24:43,640
it's lighter.
1477
01:24:43,640 --> 01:24:49,580
I can adjust my role model of that system and then adjust my actions as a function of
1478
01:24:49,580 --> 01:24:50,580
this very quickly.
1479
01:24:50,580 --> 01:24:51,580
It's not learning, actually.
1480
01:24:51,580 --> 01:24:53,560
It's just a few parameter adjustment.
1481
01:24:53,560 --> 01:24:57,060
But in other situations, you need to learn.
1482
01:24:57,060 --> 01:25:01,740
You need to adapt your role model for the situation.
1483
01:25:01,740 --> 01:25:06,600
If you have a powerful role model, you're not going to be able to train it for all possible
1484
01:25:06,600 --> 01:25:10,780
situations and all possible configurations of the world.
1485
01:25:10,780 --> 01:25:14,920
And so there are parts of the state space
1486
01:25:14,920 --> 01:25:18,120
that where your model is gonna be inaccurate.
1487
01:25:18,120 --> 01:25:20,780
And the system, if you want the system to plan accurately,
1488
01:25:20,780 --> 01:25:23,580
it needs to be able to detect when that happens.
1489
01:25:23,580 --> 01:25:26,940
So basically only plan within regions of the space
1490
01:25:26,940 --> 01:25:29,720
where the prediction of its own model is good,
1491
01:25:29,720 --> 01:25:31,660
and then adjust its model as it goes
1492
01:25:31,660 --> 01:25:33,800
if it's not the case.
1493
01:25:33,800 --> 01:25:36,400
That's where you need reinforcement learning basically.
1494
01:25:36,400 --> 01:25:38,500
Can I just ask a clarification question?
1495
01:25:38,900 --> 01:25:43,640
I think there's a lot of understanding of I'm really confident in what I'm able to do,
1496
01:25:43,900 --> 01:25:49,640
but as soon as, let's say, I throw a ball, the physics of that ball is something really unpredictable.
1497
01:25:50,100 --> 01:25:52,280
How would you differentiate that in your world model?
1498
01:25:52,520 --> 01:25:53,340
Are there parameters?
1499
01:25:53,520 --> 01:25:57,100
Yeah, so this is adaptation on the fly of your world model
1500
01:25:57,100 --> 01:26:01,080
or perhaps adjustment of a few latent variables that represent what you don't know about the world,
1501
01:26:01,200 --> 01:26:02,960
like the wind speed and things like that.
1502
01:26:02,960 --> 01:26:06,500
So, I mean, there's various mechanisms for this.
1503
01:26:07,780 --> 01:26:09,600
Okay, let's thank the speaker again.