Created
October 24, 2024 20:11
-
-
Save cyysky/bea3f0b82e13333b4a4dafb9376201af to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 | |
00:00:00,000 --> 00:00:10,480 | |
So welcome all. Welcome to this distinguished lecture series in AI. I'm Vishal Mishra. I'm the | |
2 | |
00:00:10,480 --> 00:00:15,040 | |
Vice Dean for Computing and AI in Columbia Engineering. This is the second lecture in our | |
3 | |
00:00:15,040 --> 00:00:20,420 | |
series. We seem to have a reasonably full house. People are still streaming in. So before we start, | |
4 | |
00:00:20,480 --> 00:00:26,020 | |
I'd like to invite Dean Shifu Chang to give some opening remarks. All right. Good morning, everyone. | |
5 | |
00:00:26,020 --> 00:00:28,020 | |
Welcome to our | |
6 | |
00:00:28,020 --> 00:00:34,580 | |
It's really exciting. | |
7 | |
00:00:35,060 --> 00:00:38,560 | |
This is the first time I see we have an overflow space used today. | |
8 | |
00:00:38,720 --> 00:00:40,980 | |
Really so exciting about the topic and speaker. | |
9 | |
00:00:41,540 --> 00:00:46,960 | |
I want to thank Michelle and the team for organizing the AI lecture series this semester and throughout the year. | |
10 | |
00:00:47,320 --> 00:00:51,560 | |
I want to thank our president Katrina Armstrong for coming to support our event today. | |
11 | |
00:00:51,560 --> 00:00:59,480 | |
And as Vichel mentioned, this is the second in our AI lecture series across the school | |
12 | |
00:00:59,480 --> 00:01:03,400 | |
and is associated with the university initiative in AI. | |
13 | |
00:01:03,580 --> 00:01:07,220 | |
That's one of the priorities that President Armstrong is leading us | |
14 | |
00:01:07,220 --> 00:01:10,060 | |
for the school university-wide effort here. | |
15 | |
00:01:10,720 --> 00:01:13,640 | |
Last month, we launched this new AI lecture series, | |
16 | |
00:01:14,160 --> 00:01:16,620 | |
starting with our faculty member, Pierre Gentian, | |
17 | |
00:01:16,740 --> 00:01:20,480 | |
to talk about how AI can have an impact in different disciplines. | |
18 | |
00:01:20,480 --> 00:01:23,860 | |
And so last month we launched AI and climate projection. | |
19 | |
00:01:23,860 --> 00:01:25,320 | |
And today we're so excited, | |
20 | |
00:01:25,320 --> 00:01:28,220 | |
Dr. Yang Li Kang is here to share his vision, | |
21 | |
00:01:28,220 --> 00:01:30,440 | |
his insight on very exciting topic. | |
22 | |
00:01:30,440 --> 00:01:31,960 | |
You have seen his title. | |
23 | |
00:01:31,960 --> 00:01:36,960 | |
I have seen Yang talking many times in CVPR, ICML, | |
24 | |
00:01:36,960 --> 00:01:38,220 | |
learning representation, | |
25 | |
00:01:38,220 --> 00:01:41,780 | |
but today's topic is particularly intriguing. | |
26 | |
00:01:41,780 --> 00:01:45,120 | |
And his presence, as you can see from audience today, | |
27 | |
00:01:45,120 --> 00:01:47,420 | |
we have to open up overflow space. | |
28 | |
00:01:47,420 --> 00:01:49,040 | |
The event once is announced, | |
29 | |
00:01:49,040 --> 00:01:50,460 | |
Three minutes, salt hot. | |
30 | |
00:01:50,460 --> 00:01:52,820 | |
You are the lucky ones, okay. | |
31 | |
00:01:52,820 --> 00:01:55,700 | |
And the lecture series, one of the efforts | |
32 | |
00:01:55,700 --> 00:01:59,420 | |
around AI and university, we are pursuing advances | |
33 | |
00:01:59,420 --> 00:02:02,120 | |
in the fundamental area, which is covered | |
34 | |
00:02:02,120 --> 00:02:03,560 | |
by today's lecture. | |
35 | |
00:02:03,560 --> 00:02:06,540 | |
We're also pursuing the impact in different discipline | |
36 | |
00:02:06,540 --> 00:02:10,940 | |
in collaboration among all the 17 schools at Columbia. | |
37 | |
00:02:10,940 --> 00:02:14,380 | |
Climate, business, finance, policy, journalism, you name it. | |
38 | |
00:02:14,380 --> 00:02:16,600 | |
So we work with industry, community, | |
39 | |
00:02:16,600 --> 00:02:22,340 | |
create centers on AI and finance, AI on climate, AI on sports, AI and policy. | |
40 | |
00:02:22,740 --> 00:02:23,920 | |
So that's our effort today. | |
41 | |
00:02:24,060 --> 00:02:30,500 | |
We create a new course on AI in context to teach AI in the context of humanity, in literature, | |
42 | |
00:02:30,760 --> 00:02:32,340 | |
in music, and philosophy. | |
43 | |
00:02:32,780 --> 00:02:36,460 | |
Today's topic, how could machine reach human-level intelligence? | |
44 | |
00:02:36,780 --> 00:02:40,380 | |
Just by reading the title makes me so intrigued, so excited. | |
45 | |
00:02:40,800 --> 00:02:45,580 | |
So without further ado, let me invite Vishal, our Vice Dean of AI and Computing, | |
46 | |
00:02:45,580 --> 00:02:47,900 | |
to have an introduction of our speaker, | |
47 | |
00:02:47,900 --> 00:02:49,240 | |
Yang Likang, today. | |
48 | |
00:02:49,240 --> 00:02:50,080 | |
He's here. | |
49 | |
00:02:52,820 --> 00:02:53,660 | |
Thanks, Shifu. | |
50 | |
00:02:55,940 --> 00:02:58,360 | |
So, Yang, of course, needs no introduction. | |
51 | |
00:03:03,480 --> 00:03:04,660 | |
But just to embarrass him, | |
52 | |
00:03:04,660 --> 00:03:08,140 | |
I'll give a brief introduction of Yang. | |
53 | |
00:03:08,140 --> 00:03:11,360 | |
Now, this may come as a surprise to a lot of you, | |
54 | |
00:03:11,360 --> 00:03:13,280 | |
but it's true, | |
55 | |
00:03:13,280 --> 00:03:15,620 | |
and you'll never guess it from his accent. | |
56 | |
00:03:15,620 --> 00:03:17,000 | |
Jan is actually French. | |
57 | |
00:03:18,080 --> 00:03:22,680 | |
He got his PhD from the Sorbonne in 1987, | |
58 | |
00:03:22,680 --> 00:03:24,280 | |
and in his PhD thesis, | |
59 | |
00:03:24,280 --> 00:03:28,020 | |
he proposed an early form of back propagation. | |
60 | |
00:03:28,020 --> 00:03:29,600 | |
Now back propagation is the way | |
61 | |
00:03:29,600 --> 00:03:32,540 | |
all neural networks are trained now, | |
62 | |
00:03:32,540 --> 00:03:36,440 | |
and it sort of started from his PhD thesis. | |
63 | |
00:03:37,620 --> 00:03:41,460 | |
He joined 8080 Bell Labs in 1988. | |
64 | |
00:03:41,460 --> 00:03:44,020 | |
Before that, he spent a few months or a year | |
65 | |
00:03:44,020 --> 00:03:46,920 | |
with Jeff Hinton working as a postdoc. | |
66 | |
00:03:49,840 --> 00:03:52,180 | |
I, there was an alarm, okay. | |
67 | |
00:03:52,180 --> 00:03:54,620 | |
And he joined AT&T Bell Labs in 1988. | |
68 | |
00:03:55,860 --> 00:03:58,040 | |
Next year, he sort of stunned the world | |
69 | |
00:03:58,040 --> 00:04:00,160 | |
with this handwriting recognition system. | |
70 | |
00:04:00,160 --> 00:04:01,460 | |
And you'll see a video of that. | |
71 | |
00:04:11,460 --> 00:04:38,300 | |
. | |
72 | |
00:04:38,300 --> 00:04:40,500 | |
This was absolutely incredible at that time. | |
73 | |
00:04:45,300 --> 00:04:47,900 | |
And there you see Jan looking slightly different. | |
74 | |
00:05:05,420 --> 00:05:08,120 | |
After that came a long AI and neural nets | |
75 | |
00:05:08,120 --> 00:05:13,120 | |
Winter, Jan joined AT&T Research in 1996, | |
76 | |
00:05:14,320 --> 00:05:15,400 | |
but he never gave up. | |
77 | |
00:05:15,400 --> 00:05:19,760 | |
He continued working on convolutional neural network CNNs, | |
78 | |
00:05:19,760 --> 00:05:23,960 | |
which were what he used for the handwriting recognition system. | |
79 | |
00:05:23,960 --> 00:05:27,840 | |
Around 2012, the deep learning revolution happened, | |
80 | |
00:05:27,840 --> 00:05:29,680 | |
and now CNNs are everywhere, | |
81 | |
00:05:29,680 --> 00:05:31,680 | |
whether his friend Elon Musk's cars, | |
82 | |
00:05:33,580 --> 00:05:35,460 | |
some people got what I meant, | |
83 | |
00:05:35,460 --> 00:05:40,460 | |
or Google Photos, everyone uses CNNs. | |
84 | |
00:05:42,300 --> 00:05:47,300 | |
In 2013, Jan joined Meta AI as the director of their AI lab | |
85 | |
00:05:48,000 --> 00:05:49,940 | |
and now he is the chief scientist. | |
86 | |
00:05:49,940 --> 00:05:52,520 | |
In 2018, he also won the Turing Award | |
87 | |
00:05:52,520 --> 00:05:54,840 | |
along with Jeff Hinton and Yoshua Bengio | |
88 | |
00:05:56,220 --> 00:06:00,220 | |
for his work in deep learning and artificial intelligence. | |
89 | |
00:06:00,220 --> 00:06:02,560 | |
In fact, Jeff was here yesterday. | |
90 | |
00:06:02,560 --> 00:06:04,580 | |
He was on campus and he was walking around | |
91 | |
00:06:04,580 --> 00:06:06,180 | |
And people were asking him for selfies. | |
92 | |
00:06:06,180 --> 00:06:08,980 | |
So he wanted to be here. | |
93 | |
00:06:08,980 --> 00:06:10,660 | |
Unfortunately, something urgent came up, | |
94 | |
00:06:10,660 --> 00:06:12,200 | |
so he couldn't be here. | |
95 | |
00:06:12,200 --> 00:06:16,940 | |
So as I mentioned, Jan won the Turing Award in 2013, or 2018. | |
96 | |
00:06:16,940 --> 00:06:19,860 | |
And this is a Turing Award for computer science, | |
97 | |
00:06:19,860 --> 00:06:22,580 | |
not for physics or chemistry, which are also known as Nobel | |
98 | |
00:06:22,580 --> 00:06:24,980 | |
prizes these days. | |
99 | |
00:06:24,980 --> 00:06:27,080 | |
This is the original one. | |
100 | |
00:06:27,080 --> 00:06:28,580 | |
And he won the award in 2018. | |
101 | |
00:06:28,580 --> 00:06:33,220 | |
And he's also big into the selfie game. | |
102 | |
00:06:33,220 --> 00:06:34,820 | |
I took a selfie with him that day. | |
103 | |
00:06:36,640 --> 00:06:39,080 | |
And now with that, I'll invite Jan to tell us | |
104 | |
00:06:39,080 --> 00:06:40,600 | |
about human level intelligence. | |
105 | |
00:06:48,720 --> 00:06:52,180 | |
Thank you very much for this amazing introduction. | |
106 | |
00:06:54,180 --> 00:06:56,740 | |
A real pleasure to be here. | |
107 | |
00:06:56,740 --> 00:06:59,620 | |
The good thing to come give a talk here is that | |
108 | |
00:07:00,740 --> 00:07:02,000 | |
I didn't have to fly. | |
109 | |
00:07:02,000 --> 00:07:08,180 | |
Although if you ask people from downtown, they rarely go above 23rd Street. | |
110 | |
00:07:11,700 --> 00:07:18,540 | |
So, yeah, I mean, I worked really hard to lose my French accent in the last four decades or so, | |
111 | |
00:07:18,680 --> 00:07:23,660 | |
three and a half decades. But I just recently learned that if you speak English with a French | |
112 | |
00:07:32,000 --> 00:07:35,320 | |
I should speak with a very strong French accent. | |
113 | |
00:07:36,120 --> 00:07:40,600 | |
And perhaps, appear intelligent. | |
114 | |
00:07:40,600 --> 00:07:46,800 | |
Okay. What should appear intelligent is machines, | |
115 | |
00:07:46,800 --> 00:07:49,320 | |
and they do appear intelligent. | |
116 | |
00:07:49,320 --> 00:07:52,600 | |
We, a lot of people give them IQ, | |
117 | |
00:07:52,600 --> 00:07:53,640 | |
whatever that means, | |
118 | |
00:07:53,640 --> 00:07:56,520 | |
that is actually much higher than they deserve. | |
119 | |
00:07:56,520 --> 00:08:00,160 | |
We are nowhere near being able to reach | |
120 | |
00:08:00,160 --> 00:08:03,520 | |
human intelligence or human level intelligence with machines, | |
121 | |
00:08:03,520 --> 00:08:05,780 | |
what some people call AGI, | |
122 | |
00:08:05,780 --> 00:08:07,800 | |
Artificial General Intelligence. | |
123 | |
00:08:07,800 --> 00:08:09,660 | |
I hate that term. | |
124 | |
00:08:09,660 --> 00:08:13,040 | |
I've been trying to fight against it. | |
125 | |
00:08:13,040 --> 00:08:16,600 | |
The reason is not that it's impossible for | |
126 | |
00:08:16,600 --> 00:08:17,880 | |
a machine to reach human intelligence. | |
127 | |
00:08:17,880 --> 00:08:18,720 | |
Of course, it's possible. | |
128 | |
00:08:18,720 --> 00:08:20,720 | |
There's no question at some point we'll have | |
129 | |
00:08:20,720 --> 00:08:23,300 | |
machines that are as intelligent as humans in | |
130 | |
00:08:23,300 --> 00:08:25,080 | |
all the domains where humans are intelligent. | |
131 | |
00:08:25,080 --> 00:08:27,780 | |
There's no question that they will go beyond this. | |
132 | |
00:08:27,780 --> 00:08:32,480 | |
But it's just because human intelligence is not general at all. | |
133 | |
00:08:32,480 --> 00:08:34,960 | |
We are very specialized animals. | |
134 | |
00:08:34,960 --> 00:08:41,220 | |
We have a hard time imagining that we are specialized because all the problems | |
135 | |
00:08:41,220 --> 00:08:48,080 | |
that we can fathom or imagine are problems that we can fathom or imagine. | |
136 | |
00:08:48,080 --> 00:08:54,940 | |
But there is many, many more problems that we can't even imagine in our world's dream. | |
137 | |
00:08:54,940 --> 00:08:59,500 | |
and so it makes us appear generally intelligent. | |
138 | |
00:08:59,500 --> 00:09:01,760 | |
We're not. We're specialized. | |
139 | |
00:09:01,760 --> 00:09:03,520 | |
So we should lose that term, | |
140 | |
00:09:03,520 --> 00:09:05,300 | |
artificial general intelligence. | |
141 | |
00:09:05,300 --> 00:09:08,980 | |
I prefer the term human level intelligence or a code name | |
142 | |
00:09:08,980 --> 00:09:15,480 | |
that we've adopted inside Meta is an acronym AMI, | |
143 | |
00:09:15,480 --> 00:09:18,620 | |
which means Advanced Machine Intelligence, | |
144 | |
00:09:18,620 --> 00:09:21,220 | |
which is kind of a little more loose. | |
145 | |
00:09:21,220 --> 00:09:24,020 | |
Also, we pronounce it AMI. | |
146 | |
00:09:24,020 --> 00:09:27,900 | |
Which in French means friend. | |
147 | |
00:09:28,140 --> 00:09:30,340 | |
Makes sense. | |
148 | |
00:09:30,340 --> 00:09:33,380 | |
Okay. So how can we ever reach | |
149 | |
00:09:33,380 --> 00:09:35,260 | |
human level intelligence with machines? | |
150 | |
00:09:35,260 --> 00:09:37,940 | |
Machines that can learn, of course, | |
151 | |
00:09:37,940 --> 00:09:40,220 | |
can remember, understand the physical world, | |
152 | |
00:09:40,220 --> 00:09:43,140 | |
have common sense, can plan, can reason, | |
153 | |
00:09:43,140 --> 00:09:46,020 | |
are behaving properly, | |
154 | |
00:09:46,020 --> 00:09:50,500 | |
not being unruly, dangerous, etc. | |
155 | |
00:09:50,500 --> 00:09:52,940 | |
And the first question we should ask ourselves is, | |
156 | |
00:09:52,940 --> 00:09:54,620 | |
Why would we want to build this? | |
157 | |
00:09:54,620 --> 00:09:57,260 | |
So obviously there is a big scientific question of what is | |
158 | |
00:09:57,260 --> 00:09:59,580 | |
intelligence and the best way to | |
159 | |
00:09:59,580 --> 00:10:04,060 | |
validate any theory we have about intelligence is to | |
160 | |
00:10:04,060 --> 00:10:07,500 | |
build an artifact that actually implements it. | |
161 | |
00:10:07,500 --> 00:10:11,500 | |
That's a very engineering approach to science if you want. | |
162 | |
00:10:11,500 --> 00:10:14,700 | |
But there is another good reason and the other good reason is that | |
163 | |
00:10:14,700 --> 00:10:20,140 | |
we need human level intelligence to amplify human intelligence. | |
164 | |
00:10:20,140 --> 00:10:24,620 | |
There's going to be a future in which we run | |
165 | |
00:10:24,620 --> 00:10:29,700 | |
around with AI assistant with us at all times, | |
166 | |
00:10:29,700 --> 00:10:32,460 | |
so we can ask any question from them. | |
167 | |
00:10:32,460 --> 00:10:34,280 | |
They can answer any question we have. | |
168 | |
00:10:34,280 --> 00:10:35,680 | |
They can help us in our daily lives. | |
169 | |
00:10:35,680 --> 00:10:38,100 | |
They can solve problems for us. | |
170 | |
00:10:38,100 --> 00:10:40,060 | |
This will amplify human intelligence, | |
171 | |
00:10:40,060 --> 00:10:42,100 | |
perhaps in the way that the printing press has | |
172 | |
00:10:42,100 --> 00:10:45,320 | |
amplified human intelligence in the 15th century. | |
173 | |
00:10:45,320 --> 00:10:49,420 | |
So we need this for humanity. | |
174 | |
00:10:49,420 --> 00:10:53,780 | |
In fact, I'm wearing a pair of smart glasses right now. | |
175 | |
00:10:53,780 --> 00:10:56,540 | |
I can ask it questions. | |
176 | |
00:10:56,540 --> 00:10:57,660 | |
It goes through Meta AI, | |
177 | |
00:10:57,660 --> 00:10:59,500 | |
which is the product version of | |
178 | |
00:10:59,500 --> 00:11:02,060 | |
LAMA 3 that many of you have heard of. | |
179 | |
00:11:02,060 --> 00:11:04,340 | |
I can ask you various things. | |
180 | |
00:11:04,340 --> 00:11:06,780 | |
So let me ask you something. | |
181 | |
00:11:06,780 --> 00:11:09,060 | |
I'm not going to use the microphone. | |
182 | |
00:11:09,060 --> 00:11:13,780 | |
Hey, Meta. Take a picture. | |
183 | |
00:11:13,780 --> 00:11:16,700 | |
You see that little light flash? | |
184 | |
00:11:16,700 --> 00:11:19,420 | |
Okay, you're all on picture. | |
185 | |
00:11:19,600 --> 00:11:21,900 | |
You'll be on social network soon. | |
186 | |
00:11:26,000 --> 00:11:28,660 | |
So, you know, I could ask it, you know, | |
187 | |
00:11:28,720 --> 00:11:30,020 | |
more complex questions, obviously. | |
188 | |
00:11:31,060 --> 00:11:36,720 | |
And this thing can also recognize through the camera. | |
189 | |
00:11:36,840 --> 00:11:39,500 | |
So you can ask it, what am I looking at? | |
190 | |
00:11:39,860 --> 00:11:41,180 | |
What is the species of plant? | |
191 | |
00:11:42,340 --> 00:11:45,120 | |
You know, you can look at a menu in Japanese | |
192 | |
00:11:45,120 --> 00:11:46,340 | |
and it will translate it for you. | |
193 | |
00:11:46,340 --> 00:11:49,400 | |
So, you know, this kind of assistance are coming. | |
194 | |
00:11:49,400 --> 00:11:51,020 | |
They're still pretty stupid, | |
195 | |
00:11:51,020 --> 00:11:53,680 | |
but they're already useful. | |
196 | |
00:11:53,680 --> 00:11:56,080 | |
But there is a future maybe, | |
197 | |
00:11:56,080 --> 00:11:57,880 | |
you know, 10, 20 years from now, | |
198 | |
00:11:57,880 --> 00:12:00,100 | |
where they will be really smart and they will | |
199 | |
00:12:00,100 --> 00:12:01,240 | |
assist us in their daily lives. | |
200 | |
00:12:01,240 --> 00:12:03,900 | |
So we need those systems to have human level intelligence, | |
201 | |
00:12:03,900 --> 00:12:05,940 | |
because that's the best way for them to not be | |
202 | |
00:12:05,940 --> 00:12:08,500 | |
frustrating for us to interact with. | |
203 | |
00:12:08,500 --> 00:12:10,340 | |
Okay. So on the one hand, | |
204 | |
00:12:10,340 --> 00:12:12,720 | |
there is the really interesting scientific question | |
205 | |
00:12:12,720 --> 00:12:14,800 | |
of what is intelligence. | |
206 | |
00:12:14,800 --> 00:12:18,560 | |
In the middle there is the technological challenge | |
207 | |
00:12:18,560 --> 00:12:20,480 | |
of building intelligent machines. | |
208 | |
00:12:20,480 --> 00:12:22,980 | |
Then at the other end, it's actually useful. | |
209 | |
00:12:22,980 --> 00:12:26,720 | |
It will actually be useful for people and for humanity more generally. | |
210 | |
00:12:26,720 --> 00:12:30,660 | |
So all of the conditions. | |
211 | |
00:12:30,660 --> 00:12:34,640 | |
Then the more important condition is that there are people with | |
212 | |
00:12:34,640 --> 00:12:41,820 | |
a lot of resources willing to actually invest for this to be true, like Meta. | |
213 | |
00:12:41,820 --> 00:12:52,400 | |
So, the characteristics that we want of those machines is that they need to be able to understand the physical world. | |
214 | |
00:12:52,660 --> 00:12:55,520 | |
Current AI systems do not understand the physical world. | |
215 | |
00:12:57,560 --> 00:13:01,900 | |
They don't understand the physical world nearly as well as your house cat. | |
216 | |
00:13:03,520 --> 00:13:07,320 | |
And so, I've been saying, you know, and of course, newspapers can have like this kind of title. | |
217 | |
00:13:07,320 --> 00:13:11,240 | |
You know, Yannick says AI is stupider than a cat. | |
218 | |
00:13:11,240 --> 00:13:15,540 | |
It's true, actually. | |
219 | |
00:13:15,540 --> 00:13:19,400 | |
We need AI systems that have persistent memory. | |
220 | |
00:13:19,400 --> 00:13:22,940 | |
We need them to be able to plan complex action sequences, | |
221 | |
00:13:22,940 --> 00:13:25,660 | |
which current systems are completely incapable of doing. | |
222 | |
00:13:25,660 --> 00:13:27,760 | |
We need them to be able to reason, | |
223 | |
00:13:27,760 --> 00:13:29,740 | |
and we need them to be controllable and safe. | |
224 | |
00:13:29,740 --> 00:13:32,140 | |
So basically, and by design, | |
225 | |
00:13:32,140 --> 00:13:35,580 | |
not by fine-tuning like it's done at the moment. | |
226 | |
00:13:37,040 --> 00:13:40,740 | |
That requires essentially new principles that are | |
227 | |
00:13:40,740 --> 00:13:44,980 | |
different from what current AI systems really are based on. | |
228 | |
00:13:44,980 --> 00:13:48,980 | |
So current systems, most of them anyway, | |
229 | |
00:13:48,980 --> 00:13:51,800 | |
perform inference by propagating signals through | |
230 | |
00:13:51,800 --> 00:13:54,100 | |
a bunch of layers of a neural net. | |
231 | |
00:13:54,100 --> 00:13:58,640 | |
I'm a big fan of that obviously, but it's very limited. | |
232 | |
00:13:58,640 --> 00:14:03,260 | |
There's only a small number of input-output functions that can | |
233 | |
00:14:03,260 --> 00:14:06,100 | |
be efficiently represented by feed-forward | |
234 | |
00:14:06,100 --> 00:14:09,980 | |
propagation through a bunch of layers in a neural net. | |
235 | |
00:14:09,980 --> 00:14:13,300 | |
There's a much more general approach to inference, | |
236 | |
00:14:13,300 --> 00:14:17,140 | |
which is not just running feed forward to a bunch of layers, | |
237 | |
00:14:17,140 --> 00:14:19,480 | |
but is based on optimization. | |
238 | |
00:14:19,480 --> 00:14:22,900 | |
So basically, there's an observation. | |
239 | |
00:14:22,900 --> 00:14:28,780 | |
You give the system a proposal for an output, | |
240 | |
00:14:28,780 --> 00:14:31,380 | |
and the system tells you to what extent | |
241 | |
00:14:31,380 --> 00:14:34,340 | |
the output is compatible with the observation. | |
242 | |
00:14:34,340 --> 00:14:37,500 | |
Okay. So give you a picture of an elephant. | |
243 | |
00:14:37,500 --> 00:14:42,040 | |
I put the representation of the label elephant or the text, | |
244 | |
00:14:42,040 --> 00:14:43,120 | |
and the system tells you, | |
245 | |
00:14:43,120 --> 00:14:45,060 | |
yeah, those two things are compatible. | |
246 | |
00:14:45,060 --> 00:14:49,600 | |
The label elephant is a good label for that image. | |
247 | |
00:14:49,600 --> 00:14:51,620 | |
If you put the picture of a table, | |
248 | |
00:14:51,620 --> 00:14:53,860 | |
it says no, it's incompatible. | |
249 | |
00:14:53,860 --> 00:14:56,980 | |
So if you have a system that basically measures | |
250 | |
00:14:56,980 --> 00:15:00,020 | |
the compatibility between an input and an output, | |
251 | |
00:15:00,020 --> 00:15:02,440 | |
then through optimization and search, | |
252 | |
00:15:02,440 --> 00:15:06,440 | |
you can find an output that is most compatible with the input. | |
253 | |
00:15:06,440 --> 00:15:10,100 | |
This is intrinsically more powerful as an inference mechanism | |
254 | |
00:15:10,100 --> 00:15:13,440 | |
than just running feed forward through a bunch of layers. | |
255 | |
00:15:13,440 --> 00:15:16,720 | |
Because basically, any computational problem | |
256 | |
00:15:16,720 --> 00:15:19,260 | |
can be reduced to an optimization problem. | |
257 | |
00:15:19,260 --> 00:15:23,460 | |
So that's the very basic principle on | |
258 | |
00:15:23,460 --> 00:15:25,720 | |
which future AI system should be built. | |
259 | |
00:15:25,720 --> 00:15:27,940 | |
Not propagating through a bunch of layers, | |
260 | |
00:15:27,940 --> 00:15:30,040 | |
but optimizing the answer so that | |
261 | |
00:15:30,040 --> 00:15:31,680 | |
it's most compatible with the input. | |
262 | |
00:15:31,680 --> 00:15:34,440 | |
Of course, this will involve deep learning system, | |
263 | |
00:15:34,440 --> 00:15:36,160 | |
back propagation, all that stuff. | |
264 | |
00:15:36,160 --> 00:15:38,880 | |
But the inference mechanism is very different. | |
265 | |
00:15:38,880 --> 00:15:41,700 | |
Now, this is not a new idea by all means. | |
266 | |
00:15:41,700 --> 00:15:44,060 | |
This type of inference is what is | |
267 | |
00:15:44,060 --> 00:15:46,220 | |
very standard in probabilistic inference. | |
268 | |
00:15:46,220 --> 00:15:47,700 | |
For example, if you have a graphical model, | |
269 | |
00:15:47,700 --> 00:15:50,820 | |
Bayesian network, you know the value of certain variables, | |
270 | |
00:15:50,820 --> 00:15:53,340 | |
you can infer the value of the other variables by | |
271 | |
00:15:53,340 --> 00:15:56,400 | |
minimizing a negative log likelihood or something like that, | |
272 | |
00:15:56,400 --> 00:15:58,580 | |
or with some energy function. | |
273 | |
00:15:58,580 --> 00:16:01,180 | |
So it's a very standard thing to do. | |
274 | |
00:16:01,180 --> 00:16:02,780 | |
There's nothing innovative about this, | |
275 | |
00:16:02,780 --> 00:16:05,340 | |
but people have forgotten about the fact that this is | |
276 | |
00:16:05,340 --> 00:16:08,540 | |
really much more powerful than feed-forward propagation. | |
277 | |
00:16:08,540 --> 00:16:13,200 | |
The framework that I like to explain this is called energy-based model. | |
278 | |
00:16:13,200 --> 00:16:17,460 | |
So basically, the function that measures the compatibility between X and Y, | |
279 | |
00:16:17,460 --> 00:16:20,400 | |
input and output, is an energy function that takes | |
280 | |
00:16:20,400 --> 00:16:25,700 | |
low values when input and output are compatible and larger values when they're not. | |
281 | |
00:16:29,000 --> 00:16:33,600 | |
So the type of inference that can take place to find | |
282 | |
00:16:33,600 --> 00:16:35,840 | |
the output could be a number of different things. | |
283 | |
00:16:35,840 --> 00:16:41,480 | |
If the representation of the output is continuous, | |
284 | |
00:16:41,480 --> 00:16:43,460 | |
and if the modules that we're talking about, | |
285 | |
00:16:43,460 --> 00:16:45,620 | |
the objectives, all the modules | |
286 | |
00:16:45,620 --> 00:16:47,380 | |
inside of the system are differentiable, | |
287 | |
00:16:47,380 --> 00:16:49,820 | |
you can use gradient-based optimization to find | |
288 | |
00:16:49,820 --> 00:16:53,360 | |
the best one good answer. | |
289 | |
00:16:53,360 --> 00:16:56,780 | |
But you can imagine that the output is discrete, | |
290 | |
00:16:56,780 --> 00:16:58,340 | |
combinatorial, and then you have to use | |
291 | |
00:16:58,340 --> 00:17:02,500 | |
other types of combinatorial optimization algorithms | |
292 | |
00:17:02,500 --> 00:17:06,900 | |
to figure out the best output. | |
293 | |
00:17:06,900 --> 00:17:07,960 | |
If that's the case, | |
294 | |
00:17:07,960 --> 00:17:12,280 | |
then you're talking to the wrong LeCun, | |
295 | |
00:17:12,280 --> 00:17:14,960 | |
because my brother is actually, | |
296 | |
00:17:14,960 --> 00:17:16,620 | |
he works at Google, nobody's perfect, | |
297 | |
00:17:16,620 --> 00:17:19,980 | |
but he works on, | |
298 | |
00:17:19,980 --> 00:17:22,360 | |
he's an expert in combinatorial optimization. | |
299 | |
00:17:25,680 --> 00:17:29,260 | |
So this type of inference gives AI systems | |
300 | |
00:17:29,260 --> 00:17:31,300 | |
kind of zero-shot learning ability. | |
301 | |
00:17:31,300 --> 00:17:31,960 | |
What does that mean? | |
302 | |
00:17:31,960 --> 00:17:34,860 | |
It means you give them a problem and if they can, | |
303 | |
00:17:34,860 --> 00:17:36,900 | |
if you can't formulate this problem in terms of | |
304 | |
00:17:36,900 --> 00:17:38,880 | |
the optimization problem then you get a solution to | |
305 | |
00:17:38,880 --> 00:17:42,020 | |
that problem without the system having to learn anything. | |
306 | |
00:17:42,020 --> 00:17:43,900 | |
Right? That's your shot. | |
307 | |
00:17:43,900 --> 00:17:46,120 | |
You are given, and you are students, | |
308 | |
00:17:46,120 --> 00:17:49,320 | |
you're given a new mathematics problem, something. | |
309 | |
00:17:49,320 --> 00:17:52,320 | |
You can think about it and perhaps | |
310 | |
00:17:52,320 --> 00:17:55,460 | |
solve it without learning anything new. | |
311 | |
00:17:55,460 --> 00:17:59,460 | |
Right? That's called zero shot scale. | |
312 | |
00:17:59,460 --> 00:18:05,240 | |
and in humans some psychologists also call this system two. | |
313 | |
00:18:05,240 --> 00:18:10,320 | |
So basically you devote your entire attention and consciousness to | |
314 | |
00:18:10,320 --> 00:18:13,740 | |
solving a problem that you concentrate on and you think about it and it might | |
315 | |
00:18:13,740 --> 00:18:16,840 | |
take a long time to solve that problem. | |
316 | |
00:18:16,840 --> 00:18:17,980 | |
That's system two. | |
317 | |
00:18:17,980 --> 00:18:22,220 | |
System one is when you act reactively. | |
318 | |
00:18:22,220 --> 00:18:23,200 | |
You don't have to think about it, | |
319 | |
00:18:23,200 --> 00:18:25,360 | |
it's become kind of subconscious, automatic. | |
320 | |
00:18:25,360 --> 00:18:27,140 | |
So if you are an experienced driver, | |
321 | |
00:18:27,140 --> 00:18:28,360 | |
you drive on the highway, | |
322 | |
00:18:28,360 --> 00:18:29,380 | |
you don't have to think about it. | |
323 | |
00:18:29,380 --> 00:18:30,780 | |
it's going to become automatic. | |
324 | |
00:18:30,780 --> 00:18:34,880 | |
You can hold a conversation with someone and everything. | |
325 | |
00:18:34,880 --> 00:18:37,520 | |
If you're a beginner though, | |
326 | |
00:18:37,520 --> 00:18:39,980 | |
it's your first time driving a car, | |
327 | |
00:18:39,980 --> 00:18:41,920 | |
you pay close attention. | |
328 | |
00:18:41,920 --> 00:18:43,260 | |
You're using your system to | |
329 | |
00:18:43,260 --> 00:18:48,320 | |
your entire capacity of your mind. | |
330 | |
00:18:49,140 --> 00:18:53,520 | |
So that's why we need to adopt this model. | |
331 | |
00:18:53,520 --> 00:18:56,600 | |
This framework of energy-based model is | |
332 | |
00:18:56,600 --> 00:18:59,680 | |
sort of the way to understand this at the theoretical level. | |
333 | |
00:18:59,680 --> 00:19:01,420 | |
I'm not gonna do a lot of theory here. | |
334 | |
00:19:01,420 --> 00:19:03,300 | |
This is a very diverse audience, | |
335 | |
00:19:03,300 --> 00:19:05,300 | |
but the basic idea is that, | |
336 | |
00:19:05,300 --> 00:19:06,920 | |
if you have two variables, X and Y, | |
337 | |
00:19:06,920 --> 00:19:08,040 | |
here there are scalars, | |
338 | |
00:19:08,040 --> 00:19:12,380 | |
but you can imagine that they are high dimensional inputs. | |
339 | |
00:19:12,380 --> 00:19:16,800 | |
The energy function is some sort of landscape | |
340 | |
00:19:16,800 --> 00:19:20,800 | |
where pairs of X and Y that are compatible | |
341 | |
00:19:20,800 --> 00:19:23,500 | |
have low energy and then low altitude if you want, | |
342 | |
00:19:23,500 --> 00:19:25,920 | |
and then pairs of X and Y's that are not compatible | |
343 | |
00:19:25,920 --> 00:19:27,280 | |
of higher energy. | |
344 | |
00:19:27,280 --> 00:19:30,260 | |
And so the goal of learning now is to shape | |
345 | |
00:19:30,260 --> 00:19:32,880 | |
this energy surface in such a way that it gives | |
346 | |
00:19:32,880 --> 00:19:35,360 | |
low energy to things you observe, | |
347 | |
00:19:35,360 --> 00:19:38,880 | |
training data, pairs of XY that you observe, | |
348 | |
00:19:38,880 --> 00:19:41,520 | |
and then higher energy to everything else. | |
349 | |
00:19:41,520 --> 00:19:43,400 | |
The first part is super easy | |
350 | |
00:19:43,400 --> 00:19:44,860 | |
because we know how to do gradient descent. | |
351 | |
00:19:44,860 --> 00:19:48,760 | |
So you give a pair of XY that you know are compatible | |
352 | |
00:19:48,760 --> 00:19:51,860 | |
and you tweak the system so that the scalar output, | |
353 | |
00:19:51,860 --> 00:19:55,840 | |
the energy, the scalar energy output that it produces | |
354 | |
00:19:55,840 --> 00:20:00,000 | |
You can tweak the parameters inside your big neural net so that the output goes down. | |
355 | |
00:20:00,000 --> 00:20:06,240 | |
Easy. The difficulty is how to make sure that the energy is higher outside of the training sample. | |
356 | |
00:20:06,240 --> 00:20:10,080 | |
The training samples in this diagram are represented by the black dots. | |
357 | |
00:20:12,720 --> 00:20:18,480 | |
And at some level, a lot of literature in machine learning is devoted to that problem. | |
358 | |
00:20:18,480 --> 00:20:23,840 | |
It's not formulated in the way I just did, but it's in probably a framework for example, | |
359 | |
00:20:23,840 --> 00:20:31,060 | |
This problem of making sure the energy of things outside the training data is high, | |
360 | |
00:20:31,060 --> 00:20:33,240 | |
is a major issue. | |
361 | |
00:20:33,240 --> 00:20:40,280 | |
It usually encounters intractable mathematical problems. | |
362 | |
00:20:40,280 --> 00:20:42,160 | |
Let me skip this for now. | |
363 | |
00:20:42,160 --> 00:20:47,880 | |
Okay. So now, the whole craze of AI over the last couple of years, | |
364 | |
00:20:47,880 --> 00:20:50,880 | |
three years let's say, has been around LLMs, | |
365 | |
00:20:50,880 --> 00:20:53,320 | |
Large language models and large language models should be | |
366 | |
00:20:53,320 --> 00:20:56,200 | |
really called auto-regressive large language models. | |
367 | |
00:20:56,200 --> 00:21:00,660 | |
So what they do is they're trained on lots of texts and they're | |
368 | |
00:21:00,660 --> 00:21:03,900 | |
basically trained to produce the next word, | |
369 | |
00:21:03,900 --> 00:21:08,600 | |
to predict the next word from a sequence of words that preceded. | |
370 | |
00:21:09,640 --> 00:21:14,360 | |
That's all they've been trained to do. | |
371 | |
00:21:14,840 --> 00:21:17,680 | |
Once the system has been trained, | |
372 | |
00:21:17,680 --> 00:21:20,620 | |
you can of course show it a piece of text and then ask | |
373 | |
00:21:20,620 --> 00:21:23,440 | |
to predict the next word and then you inject that next word into | |
374 | |
00:21:23,440 --> 00:21:26,080 | |
the input and ask you to predict the second next word, | |
375 | |
00:21:26,080 --> 00:21:27,780 | |
shift that into the input, | |
376 | |
00:21:27,780 --> 00:21:29,060 | |
third word, etc. | |
377 | |
00:21:29,060 --> 00:21:30,620 | |
So that's auto-regressive prediction. | |
378 | |
00:21:30,620 --> 00:21:36,180 | |
It's not a new concept that's been around for before I was born. | |
379 | |
00:21:36,180 --> 00:21:39,000 | |
So not recent. | |
380 | |
00:21:39,000 --> 00:21:41,400 | |
But it's system one. | |
381 | |
00:21:41,400 --> 00:21:44,400 | |
It's feed forward propagation through a bunch of layers. | |
382 | |
00:21:44,400 --> 00:21:46,300 | |
There is a fixed amount of | |
383 | |
00:21:46,300 --> 00:21:50,240 | |
computation devoted to computing every new token. | |
384 | |
00:21:50,240 --> 00:21:56,280 | |
So if you want a system to spend more resources producing an answer, | |
385 | |
00:21:56,280 --> 00:21:57,540 | |
a system of this type, | |
386 | |
00:21:57,540 --> 00:22:01,960 | |
you basically have to artificially make it produce more tokens, | |
387 | |
00:22:01,960 --> 00:22:03,640 | |
which seems kind of a hack. | |
388 | |
00:22:03,640 --> 00:22:05,400 | |
That's called chain of thought. | |
389 | |
00:22:05,400 --> 00:22:13,260 | |
There's various techniques to do approximate planning or reasoning using this. | |
390 | |
00:22:13,260 --> 00:22:18,200 | |
You basically have the system produce lots and lots of candidate outputs by | |
391 | |
00:22:18,200 --> 00:22:23,920 | |
kind of changing the noise in the way it produces the sequences and then within | |
392 | |
00:22:23,920 --> 00:22:28,140 | |
the list of outputs that it produces you search for a good one essentially. | |
393 | |
00:22:28,140 --> 00:22:32,080 | |
So there's a little bit of search there, a little bit of optimization but it's | |
394 | |
00:22:32,080 --> 00:22:37,580 | |
kind of a hack. So I don't believe those methods will ever lead to true | |
395 | |
00:22:37,580 --> 00:22:44,840 | |
intelligent behavior. In fact cognitive scientists agree. Cognitive scientists | |
396 | |
00:22:44,840 --> 00:22:50,540 | |
I've been looking at LLMs with a very critical eye and saying that this is not real intelligence. | |
397 | |
00:22:50,540 --> 00:22:53,640 | |
This is nothing like what we observe in people. | |
398 | |
00:22:53,640 --> 00:22:59,840 | |
Similarly, people coming from kind of the non-machine learning based AI community, | |
399 | |
00:22:59,840 --> 00:23:03,240 | |
people like Subarro Kambampati from Arizona State, | |
400 | |
00:23:03,240 --> 00:23:05,740 | |
I've been saying LLMs really cannot plan. | |
401 | |
00:23:05,740 --> 00:23:09,340 | |
So Rao has a whole bunch of papers. | |
402 | |
00:23:14,840 --> 00:23:20,400 | |
talk about the titles of those papers as LLMs can't plan, | |
403 | |
00:23:20,400 --> 00:23:22,720 | |
LLMs still can't plan, | |
404 | |
00:23:22,720 --> 00:23:25,680 | |
LLMs really, really can't plan, | |
405 | |
00:23:25,680 --> 00:23:31,080 | |
and even LLMs that claim to be able to plan can't actually plan. | |
406 | |
00:23:31,080 --> 00:23:37,340 | |
So we have a big problem there that the people who claim | |
407 | |
00:23:37,340 --> 00:23:40,120 | |
that somehow we're going to take the current paradigm, | |
408 | |
00:23:40,120 --> 00:23:44,420 | |
make it bigger, spend trillions on data centers, | |
409 | |
00:23:44,420 --> 00:23:48,280 | |
and collect every piece of data in the world and train | |
410 | |
00:23:48,280 --> 00:23:50,940 | |
LLMs and they're going to reach human level intelligence. | |
411 | |
00:23:50,940 --> 00:23:53,340 | |
That's completely false in my opinion. | |
412 | |
00:23:53,340 --> 00:23:54,780 | |
I might be wrong, | |
413 | |
00:23:54,780 --> 00:23:58,140 | |
but in my opinion, that's completely hopeless. | |
414 | |
00:23:58,140 --> 00:24:01,180 | |
So the question is, what is not hopeless? | |
415 | |
00:24:01,180 --> 00:24:07,720 | |
So if we agree to this basic principle of inference to optimization, | |
416 | |
00:24:07,720 --> 00:24:12,700 | |
how can we sort of instantiate this in | |
417 | |
00:24:12,700 --> 00:24:15,000 | |
a real intelligent system. | |
418 | |
00:24:15,000 --> 00:24:18,100 | |
Basically, doing a little bit of introspection, | |
419 | |
00:24:18,100 --> 00:24:21,180 | |
when we think, the way we think is generally | |
420 | |
00:24:21,180 --> 00:24:24,060 | |
independent of the language that we might be able to | |
421 | |
00:24:24,060 --> 00:24:26,220 | |
express this thought in. | |
422 | |
00:24:26,220 --> 00:24:29,140 | |
I'm thinking about saying things here and it's | |
423 | |
00:24:29,140 --> 00:24:31,660 | |
independent of whether I'm giving | |
424 | |
00:24:31,660 --> 00:24:33,900 | |
this talk in English or French. | |
425 | |
00:24:33,900 --> 00:24:37,940 | |
So there is a thought that is independent of language, | |
426 | |
00:24:37,940 --> 00:24:41,140 | |
and LLMs don't have this capacity really. | |
427 | |
00:24:41,140 --> 00:24:45,140 | |
When we think we have a mental model of the situation that we think of. | |
428 | |
00:24:45,140 --> 00:24:47,900 | |
We're planning a sequence of actions. | |
429 | |
00:24:47,900 --> 00:24:52,020 | |
We have a mental model that allows us to predict | |
430 | |
00:24:52,020 --> 00:24:54,660 | |
what the consequences of our actions are going to be, | |
431 | |
00:24:54,660 --> 00:24:57,260 | |
so that if we set a goal for ourselves, | |
432 | |
00:24:57,260 --> 00:25:02,100 | |
we can figure out a sequence of actions that will satisfy this goal. | |
433 | |
00:25:02,100 --> 00:25:07,680 | |
So, association of the model I talked about earlier is one like this, | |
434 | |
00:25:07,680 --> 00:25:11,240 | |
where you observe the world through a perception module. | |
435 | |
00:25:11,240 --> 00:25:12,800 | |
Think of it as a big neural net. | |
436 | |
00:25:12,800 --> 00:25:15,800 | |
It gives you some idea of the current state of the world. | |
437 | |
00:25:15,800 --> 00:25:17,140 | |
Now, of course, the current state of the world | |
438 | |
00:25:17,140 --> 00:25:18,720 | |
is whatever you can perceive, | |
439 | |
00:25:18,720 --> 00:25:20,080 | |
but your idea of the state of the world | |
440 | |
00:25:20,080 --> 00:25:23,920 | |
also contains stuff that you perceived in the past, | |
441 | |
00:25:23,920 --> 00:25:27,460 | |
stuff that you know, facts that you know about the world. | |
442 | |
00:25:27,460 --> 00:25:31,480 | |
So if I take this bottle of water | |
443 | |
00:25:31,480 --> 00:25:35,380 | |
and I move it from this side to that side of the lectern, | |
444 | |
00:25:35,380 --> 00:25:40,460 | |
Your model of the world hasn't changed much. | |
445 | |
00:25:40,460 --> 00:25:45,020 | |
Most of your ideas about the state of the world hasn't changed. | |
446 | |
00:25:45,020 --> 00:25:50,420 | |
What has changed is the content of this lectern and the position of that box. | |
447 | |
00:25:50,420 --> 00:25:53,060 | |
But other than that, not much. | |
448 | |
00:25:53,060 --> 00:25:57,580 | |
So the idea that somehow a perception gives you | |
449 | |
00:25:57,580 --> 00:25:59,900 | |
a complete picture of the state of the world is false. | |
450 | |
00:25:59,900 --> 00:26:02,060 | |
You need to combine this with a memory. | |
451 | |
00:26:02,060 --> 00:26:04,260 | |
So that's this memory module here. | |
452 | |
00:26:04,260 --> 00:26:08,620 | |
Combine your current perception with the content of your memory. | |
453 | |
00:26:08,620 --> 00:26:11,200 | |
That gives you an idea of the current state of the world. | |
454 | |
00:26:11,200 --> 00:26:14,940 | |
Now, what you're going to do is feed this to a world model, | |
455 | |
00:26:14,940 --> 00:26:19,440 | |
and you're going to hear that phrase many times in the rest of the talk. | |
456 | |
00:26:19,440 --> 00:26:22,560 | |
The role of this world model is to predict what | |
457 | |
00:26:22,560 --> 00:26:25,220 | |
the outcome of a sequence of actions is going to be. | |
458 | |
00:26:25,220 --> 00:26:27,340 | |
This could be actions that you're planning to take, | |
459 | |
00:26:27,340 --> 00:26:29,540 | |
or this could be the agent is planning to take, | |
460 | |
00:26:29,540 --> 00:26:31,980 | |
or actions that someone else may be taking, | |
461 | |
00:26:31,980 --> 00:26:34,240 | |
or some events that may be occurring. | |
462 | |
00:26:34,240 --> 00:26:37,080 | |
So predicting the outcome of | |
463 | |
00:26:37,080 --> 00:26:40,920 | |
a sequence of actions is what allows us to reason and plan. | |
464 | |
00:26:41,800 --> 00:26:48,000 | |
So you can probably tell that if I take this water bottle | |
465 | |
00:26:48,000 --> 00:26:53,760 | |
and I put it on his head and I lift my finger, | |
466 | |
00:26:53,760 --> 00:26:57,320 | |
you can have some pretty good idea of what's going to happen. | |
467 | |
00:26:57,320 --> 00:26:59,080 | |
It's probably going to fall, right? | |
468 | |
00:26:59,080 --> 00:27:01,520 | |
It's either going to fall on this side or that side. | |
469 | |
00:27:01,520 --> 00:27:04,220 | |
You may not be able to predict this because I'm balancing it. | |
470 | |
00:27:04,220 --> 00:27:06,520 | |
but it's going to fall on one side or the other. | |
471 | |
00:27:06,520 --> 00:27:08,820 | |
So to some extent, at an abstract level, | |
472 | |
00:27:08,820 --> 00:27:10,440 | |
you can say it's going to fall. | |
473 | |
00:27:10,440 --> 00:27:12,720 | |
I can't tell you exactly in which position, | |
474 | |
00:27:12,720 --> 00:27:15,120 | |
in which direction, but I can tell you it's going to fall. | |
475 | |
00:27:15,120 --> 00:27:17,520 | |
You have an intuitive physics model, | |
476 | |
00:27:17,520 --> 00:27:20,440 | |
which is in fact very sophisticated, | |
477 | |
00:27:20,440 --> 00:27:23,280 | |
even though the situation is incredibly simple. | |
478 | |
00:27:23,280 --> 00:27:27,060 | |
So that allows us to plan. | |
479 | |
00:27:27,060 --> 00:27:29,200 | |
This model of the world is what allows us to plan. | |
480 | |
00:27:29,200 --> 00:27:34,200 | |
So then we can have a system like this that has a task objective, | |
481 | |
00:27:34,200 --> 00:27:38,040 | |
It sets itself an objective for itself, | |
482 | |
00:27:38,040 --> 00:27:42,680 | |
or you set an objective that measures to what extent a task has been accomplished, | |
483 | |
00:27:42,680 --> 00:27:48,200 | |
whether the resulting state of the world matches some condition. | |
484 | |
00:27:48,520 --> 00:27:53,560 | |
You might also have a number of guardrail objectives, | |
485 | |
00:27:53,560 --> 00:28:00,000 | |
things that make sure that whatever actions the agent takes, | |
486 | |
00:28:00,000 --> 00:28:03,360 | |
nobody's going to get hurt, for example. | |
487 | |
00:28:03,360 --> 00:28:08,360 | |
So those square boxes are cost functions, | |
488 | |
00:28:08,360 --> 00:28:10,840 | |
they have an implicit scalar output, | |
489 | |
00:28:10,840 --> 00:28:13,600 | |
and the overall energy of the system is just the sum of | |
490 | |
00:28:13,600 --> 00:28:18,120 | |
the scalar outputs of all the red square boxes. | |
491 | |
00:28:18,120 --> 00:28:19,760 | |
The other modules there, | |
492 | |
00:28:19,760 --> 00:28:22,040 | |
the one with a round shape, | |
493 | |
00:28:22,040 --> 00:28:24,920 | |
are deterministic functions, neural nets, let's say, | |
494 | |
00:28:24,920 --> 00:28:27,200 | |
and the round shapes are variables. | |
495 | |
00:28:27,200 --> 00:28:29,400 | |
The action sequence is a latent variable, | |
496 | |
00:28:29,400 --> 00:28:32,680 | |
it's not observed, we're going to compute it by optimization. | |
497 | |
00:28:32,680 --> 00:28:36,260 | |
We're going to try to find a sequence of actions that minimize | |
498 | |
00:28:36,260 --> 00:28:40,160 | |
the sum of the task objective and the guardrail objectives, | |
499 | |
00:28:40,160 --> 00:28:42,860 | |
and that's going to be the output of the system. | |
500 | |
00:28:44,440 --> 00:28:47,160 | |
Again, that's intrinsically more powerful than | |
501 | |
00:28:47,160 --> 00:28:50,580 | |
just running through a bunch of feed-forward layers. | |
502 | |
00:28:50,580 --> 00:28:53,860 | |
So that's the basic architecture. | |
503 | |
00:28:53,860 --> 00:28:57,000 | |
We can specialize this architecture further. | |
504 | |
00:28:57,000 --> 00:28:58,820 | |
For a sequence of actions, | |
505 | |
00:28:58,820 --> 00:29:02,060 | |
I might need to use my work model multiple times. | |
506 | |
00:29:02,060 --> 00:29:06,680 | |
So if I move that model from here to here, | |
507 | |
00:29:06,680 --> 00:29:08,200 | |
and then from here to here, | |
508 | |
00:29:08,200 --> 00:29:09,460 | |
that's a sequence of two actions. | |
509 | |
00:29:09,460 --> 00:29:11,640 | |
I don't need to have a separate model for those two actions. | |
510 | |
00:29:11,640 --> 00:29:14,200 | |
It's the same model that is just applied twice. | |
511 | |
00:29:14,200 --> 00:29:17,580 | |
So that's what's represented here, | |
512 | |
00:29:17,580 --> 00:29:21,360 | |
where action one and action two go into the same model, | |
513 | |
00:29:21,360 --> 00:29:24,700 | |
and it computes the resulting state. | |
514 | |
00:29:24,700 --> 00:29:28,520 | |
Planning a sequence of actions to optimize a cost function, | |
515 | |
00:29:28,520 --> 00:29:30,920 | |
according to a model that you run multiple times, | |
516 | |
00:29:31,080 --> 00:29:35,480 | |
is a completely standard method in optimal control called model predictive control. | |
517 | |
00:29:36,040 --> 00:29:41,720 | |
It's been around with us for since the early 60s so it's as old as me. | |
518 | |
00:29:43,320 --> 00:29:50,920 | |
And this is what you know the entire optimal control community uses to do motion planning. | |
519 | |
00:29:50,920 --> 00:29:57,160 | |
Robotics uses motion planning. NASA uses motion planning to you know plan the trajectory of | |
520 | |
00:29:57,160 --> 00:29:58,920 | |
rockets to rendezvous the space station. | |
521 | |
00:29:58,920 --> 00:30:00,780 | |
It's this type of model. | |
522 | |
00:30:00,780 --> 00:30:03,480 | |
The difference here is that the world model is going to be learned. | |
523 | |
00:30:03,480 --> 00:30:04,360 | |
It's going to be trained. | |
524 | |
00:30:04,360 --> 00:30:08,080 | |
It's not going to be returned by hand with a bunch of equations. | |
525 | |
00:30:08,080 --> 00:30:10,340 | |
It's going to be trained from data. | |
526 | |
00:30:10,340 --> 00:30:13,540 | |
Of course, the question is, how do we do this? | |
527 | |
00:30:13,540 --> 00:30:14,840 | |
I'll come to this in a second. | |
528 | |
00:30:14,840 --> 00:30:18,320 | |
Now, the sad thing about the world is two things. | |
529 | |
00:30:18,320 --> 00:30:24,060 | |
First thing is, you cannot run the world faster than real-time. | |
530 | |
00:30:24,060 --> 00:30:27,500 | |
That's the limitation. | |
531 | |
00:30:27,500 --> 00:30:28,940 | |
We have to deal with that. | |
532 | |
00:30:28,940 --> 00:30:31,220 | |
The second one is that the world is not deterministic. | |
533 | |
00:30:31,220 --> 00:30:36,160 | |
Or if it is deterministic as some physicists tell us it is, | |
534 | |
00:30:36,160 --> 00:30:38,860 | |
it's not entirely predictable because we don't have | |
535 | |
00:30:38,860 --> 00:30:41,960 | |
a full observation of the state of the world. | |
536 | |
00:30:41,960 --> 00:30:45,260 | |
The way you model | |
537 | |
00:30:45,260 --> 00:30:48,720 | |
non-deterministic functions out of deterministic functions, | |
538 | |
00:30:48,720 --> 00:30:51,820 | |
is that you feed them extra inputs that are latent variables. | |
539 | |
00:30:51,820 --> 00:30:54,560 | |
Those are variables whose values you don't know, | |
540 | |
00:30:54,560 --> 00:30:57,480 | |
and you can make them swipe through a bunch of, | |
541 | |
00:30:57,480 --> 00:31:01,100 | |
to a set or you can sample them from distributions. | |
542 | |
00:31:01,100 --> 00:31:03,260 | |
For each value of the latent variable, | |
543 | |
00:31:03,260 --> 00:31:06,260 | |
you get a different prediction from your model. | |
544 | |
00:31:06,260 --> 00:31:10,220 | |
Okay. So a distribution over the latent variable implies | |
545 | |
00:31:10,220 --> 00:31:13,580 | |
a distribution over the output of the model. | |
546 | |
00:31:13,580 --> 00:31:17,060 | |
That's the way to handle uncertainty. | |
547 | |
00:31:17,060 --> 00:31:20,260 | |
Of course, you know, you have to plan in the presence of uncertainty. | |
548 | |
00:31:20,260 --> 00:31:28,260 | |
So you want to make sure that your plan will succeed regardless of what the values of the latent variable will be. | |
549 | |
00:31:30,260 --> 00:31:37,260 | |
But in fact, humans and animals don't do planning this way. We do hierarchical planning. | |
550 | |
00:31:37,260 --> 00:31:44,260 | |
So hierarchical planning means that we have multiple levels of abstraction for representing the state of the world. | |
551 | |
00:31:44,260 --> 00:31:49,260 | |
We don't represent the world always with the same level of abstraction. | |
552 | |
00:31:49,260 --> 00:31:52,660 | |
Let me take a concrete example here. | |
553 | |
00:31:52,660 --> 00:31:56,100 | |
So let's say I'm sitting in my office in NYU | |
554 | |
00:31:56,100 --> 00:31:57,520 | |
and I want to go to Paris. | |
555 | |
00:31:58,540 --> 00:32:00,200 | |
At a very high abstract level, | |
556 | |
00:32:00,200 --> 00:32:02,380 | |
I can predict that if I decide right now | |
557 | |
00:32:02,380 --> 00:32:03,880 | |
to be in Paris tomorrow morning, | |
558 | |
00:32:05,420 --> 00:32:07,640 | |
I can go to the airport tonight | |
559 | |
00:32:07,640 --> 00:32:10,540 | |
and catch a plane to Paris and fly overnight. | |
560 | |
00:32:11,500 --> 00:32:13,380 | |
That's a plan, it's a very high level plan. | |
561 | |
00:32:13,380 --> 00:32:15,240 | |
I can't predict all the details of what's gonna happen, | |
562 | |
00:32:15,240 --> 00:32:16,540 | |
but at a high level, | |
563 | |
00:32:16,540 --> 00:32:21,540 | |
I know that I need to go to the airport and then catch a plane. | |
564 | |
00:32:21,540 --> 00:32:24,540 | |
Now I have a sub-goal. How do I go to the airport? | |
565 | |
00:32:24,540 --> 00:32:30,540 | |
Well, I need to go down on the street and hail a taxi because we're in New York. | |
566 | |
00:32:30,540 --> 00:32:33,540 | |
How do I go down on the street? | |
567 | |
00:32:33,540 --> 00:32:39,540 | |
I need to go to the elevator, push the button, and then walk out the door. | |
568 | |
00:32:39,540 --> 00:32:42,540 | |
How do I go to the elevator? | |
569 | |
00:32:42,540 --> 00:32:51,920 | |
I need to stand up from my chair, pick up my bag, open the door, close the door, walk to the elevator, avoid all the obstacles that I perceive, push the button. | |
570 | |
00:32:53,120 --> 00:32:54,420 | |
How do I stand up from my chair? | |
571 | |
00:32:56,060 --> 00:33:01,860 | |
So there is a level below which language is insufficient to express what we need to do. | |
572 | |
00:33:02,800 --> 00:33:05,080 | |
You cannot explain to someone how you stand up from a chair. | |
573 | |
00:33:06,540 --> 00:33:10,920 | |
You cannot have to know this in your muscle. | |
574 | |
00:33:10,920 --> 00:33:13,800 | |
You need to understand the physical world to be able to do this. | |
575 | |
00:33:13,800 --> 00:33:16,220 | |
So that's the other limitation of LLMs. | |
576 | |
00:33:16,220 --> 00:33:20,420 | |
Their level of abstraction is high because they manipulate language, | |
577 | |
00:33:20,420 --> 00:33:23,800 | |
but they're not grounded on reality. | |
578 | |
00:33:23,800 --> 00:33:27,380 | |
They have no idea what the physical world is like. | |
579 | |
00:33:27,380 --> 00:33:33,260 | |
That drives them to make really stupid mistakes and appear very, | |
580 | |
00:33:33,260 --> 00:33:35,540 | |
very stupid in many situations. | |
581 | |
00:33:35,540 --> 00:33:38,640 | |
So we need systems that really go | |
582 | |
00:33:38,640 --> 00:33:41,160 | |
down all the way down to the level. | |
583 | |
00:33:41,160 --> 00:33:43,960 | |
And this is what your house cat can do | |
584 | |
00:33:43,960 --> 00:33:45,300 | |
and LLMs cannot do. | |
585 | |
00:33:46,140 --> 00:33:48,140 | |
Which is why I'm saying your house cat is smarter | |
586 | |
00:33:48,140 --> 00:33:50,800 | |
than the smartest LLMs. | |
587 | |
00:33:50,800 --> 00:33:54,020 | |
Of course house cats don't have nearly as much | |
588 | |
00:33:54,020 --> 00:33:58,600 | |
abstract knowledge stored in their memory as an LLM. | |
589 | |
00:33:58,600 --> 00:34:02,120 | |
But they're really smart in their understanding of the world | |
590 | |
00:34:02,120 --> 00:34:03,200 | |
and their ability to plan. | |
591 | |
00:34:03,200 --> 00:34:05,200 | |
And they can plan hierarchically as well. | |
592 | |
00:34:05,200 --> 00:34:13,540 | |
So what we need there is, you know, world models that are at multiple levels of abstraction, | |
593 | |
00:34:13,540 --> 00:34:16,420 | |
and how to train this is not completely obvious. | |
594 | |
00:34:16,420 --> 00:34:23,520 | |
Okay, so this whole idea, this whole kind of spiel leads to a view of AI that I call | |
595 | |
00:34:23,520 --> 00:34:25,420 | |
Objective Driven AI Systems. | |
596 | |
00:34:25,420 --> 00:34:26,760 | |
It's a recent name. | |
597 | |
00:34:26,760 --> 00:34:33,700 | |
I wrote a vision paper two and a half years ago that I put online at this URL in open | |
598 | |
00:34:33,700 --> 00:34:41,300 | |
review is not on archive because I work on comments and so that I can update this paper. | |
599 | |
00:34:41,300 --> 00:34:47,260 | |
And it's the groundwork for the talk I'm giving at the moment, but in the last two and a half | |
600 | |
00:34:47,260 --> 00:34:51,380 | |
years we've made progress towards that plan, so I'm going to give you some experimental | |
601 | |
00:34:51,380 --> 00:34:55,340 | |
results and things we built. | |
602 | |
00:34:55,340 --> 00:34:59,760 | |
So the architecture I'm proposing in that paper is a so-called cognitive architecture | |
603 | |
00:34:59,760 --> 00:35:02,300 | |
that has the components I just expressed, | |
604 | |
00:35:02,300 --> 00:35:03,800 | |
things like a perception module | |
605 | |
00:35:03,800 --> 00:35:05,440 | |
that estimates the state of the world, | |
606 | |
00:35:05,440 --> 00:35:08,440 | |
a memory that you can use, | |
607 | |
00:35:08,440 --> 00:35:11,160 | |
a world model which is kind of a centerpiece a little bit, | |
608 | |
00:35:11,160 --> 00:35:12,940 | |
a bunch of cost modules | |
609 | |
00:35:12,940 --> 00:35:16,800 | |
that are either defining tasks or guardrails, | |
610 | |
00:35:16,800 --> 00:35:18,840 | |
and then an actor, and what the actor does | |
611 | |
00:35:18,840 --> 00:35:20,720 | |
is that basically finding, | |
612 | |
00:35:20,720 --> 00:35:22,520 | |
doing this optimization procedure, | |
613 | |
00:35:22,520 --> 00:35:24,020 | |
finding the best sequence of actions | |
614 | |
00:35:24,020 --> 00:35:26,380 | |
to satisfy the objectives. | |
615 | |
00:35:26,380 --> 00:35:28,600 | |
This is mysterious configurator module at the top, | |
616 | |
00:35:28,600 --> 00:35:29,780 | |
I'm not going to explain, | |
617 | |
00:35:29,780 --> 00:35:36,160 | |
but basically its role would be to set the goal for the current situation. | |
618 | |
00:35:36,160 --> 00:35:37,160 | |
Okay. | |
619 | |
00:35:37,160 --> 00:35:43,100 | |
Okay. So perhaps with an architecture of this type, | |
620 | |
00:35:43,100 --> 00:35:45,840 | |
we will have systems that understand the physical world, etc. | |
621 | |
00:35:45,840 --> 00:35:51,000 | |
But we have to, and have system two ability of kind of reasoning. | |
622 | |
00:35:51,000 --> 00:35:55,460 | |
But then how can we learn those world models from sensory inputs? | |
623 | |
00:35:55,460 --> 00:35:57,520 | |
That's really kind of the trick. | |
624 | |
00:35:57,520 --> 00:36:00,280 | |
And the answer to this is self-supervised learning. | |
625 | |
00:36:00,280 --> 00:36:07,280 | |
So self-supervised learning is something that has been extremely successful in the context of natural language understanding over the last few years. | |
626 | |
00:36:07,280 --> 00:36:10,080 | |
Basically it's completely dominating NLP. | |
627 | |
00:36:10,080 --> 00:36:15,160 | |
Every NLP system, LLM, etc. are trained with self-supervised learning. | |
628 | |
00:36:15,160 --> 00:36:19,080 | |
What does that mean? It means that there is no difference between inputs and outputs. | |
629 | |
00:36:19,080 --> 00:36:27,480 | |
Basically you take a big input, you corrupt it in some way, and you train some gigantic neural net to restore the full input if you want. | |
630 | |
00:36:27,480 --> 00:36:32,480 | |
But, you know, it's not going to be sufficient. | |
631 | |
00:36:32,480 --> 00:36:36,480 | |
We're still, you know, we're missing another piece of evidence | |
632 | |
00:36:36,480 --> 00:36:39,480 | |
that we're missing something big about intelligence is that, | |
633 | |
00:36:39,480 --> 00:36:44,480 | |
although we have NLMs that can pass the bar exam, | |
634 | |
00:36:44,480 --> 00:36:50,480 | |
or some high school exams, maybe not calculus one, I don't know, | |
635 | |
00:36:50,480 --> 00:36:56,480 | |
we still do not have domestic robots that can accomplish tasks | |
636 | |
00:36:56,480 --> 00:37:00,480 | |
A 10 year old can learn in one shot or zero shot. | |
637 | |
00:37:00,480 --> 00:37:02,480 | |
The first time you ask a 10 year old, | |
638 | |
00:37:02,480 --> 00:37:04,480 | |
clear the dinner table and fill up the dishwasher, | |
639 | |
00:37:04,480 --> 00:37:06,480 | |
they're able to do it. | |
640 | |
00:37:06,480 --> 00:37:08,480 | |
They don't need to learn. | |
641 | |
00:37:08,480 --> 00:37:10,480 | |
They can just plan. | |
642 | |
00:37:12,480 --> 00:37:14,480 | |
Any 17 year old can learn to drive a car | |
643 | |
00:37:14,480 --> 00:37:16,480 | |
in about 20 hours of practice. | |
644 | |
00:37:16,480 --> 00:37:20,480 | |
We still do not have level 5 autonomous self-driving cars. | |
645 | |
00:37:20,480 --> 00:37:22,480 | |
We have level 2, we have level 3, | |
646 | |
00:37:22,480 --> 00:37:24,480 | |
so they're partially autonomous. | |
647 | |
00:37:24,480 --> 00:37:29,360 | |
autonomous. We have some level fives in limited areas, but they are very | |
648 | |
00:37:29,360 --> 00:37:32,700 | |
instrumented and they cheat. They have a map of the entire environment, so if you | |
649 | |
00:37:32,700 --> 00:37:36,660 | |
think about the Waymo cars, that's where they are. And they certainly don't | |
650 | |
00:37:36,660 --> 00:37:42,120 | |
need only 20 hours of practice to learn to drive. So that's what we're missing, | |
651 | |
00:37:42,120 --> 00:37:47,140 | |
something big. And that's really a new version of the Moravec paradox that, you | |
652 | |
00:37:47,140 --> 00:37:50,880 | |
know, things that are easy for humans are difficult for AI and vice versa. And | |
653 | |
00:37:50,880 --> 00:37:54,760 | |
we've tended to neglect the complexity | |
654 | |
00:37:54,760 --> 00:37:55,940 | |
of dealing with the real world, | |
655 | |
00:37:55,940 --> 00:38:00,720 | |
like perception and action, motor control. | |
656 | |
00:38:00,720 --> 00:38:02,320 | |
Perhaps a reason for this | |
657 | |
00:38:02,320 --> 00:38:05,480 | |
resides in this really simple calculation. | |
658 | |
00:38:05,480 --> 00:38:07,560 | |
An LLM, a typical LLM of today, | |
659 | |
00:38:07,560 --> 00:38:10,060 | |
is trained on 20 trillion tokens, okay? | |
660 | |
00:38:10,060 --> 00:38:11,360 | |
Two, 10 to the 13. | |
661 | |
00:38:13,300 --> 00:38:17,140 | |
That corresponds to a little less than 20 trillion words, | |
662 | |
00:38:17,140 --> 00:38:18,560 | |
because the token is a subword unit. | |
663 | |
00:38:18,560 --> 00:38:21,860 | |
Each token usually is represented by three bytes | |
664 | |
00:38:21,860 --> 00:38:22,680 | |
or something like that. | |
665 | |
00:38:22,680 --> 00:38:25,920 | |
So that is a volume of training data | |
666 | |
00:38:25,920 --> 00:38:27,880 | |
of six, 10 to the 13 bytes. | |
667 | |
00:38:29,800 --> 00:38:31,420 | |
That would take a few hundred thousand years | |
668 | |
00:38:31,420 --> 00:38:33,320 | |
for any of us to read through that material. | |
669 | |
00:38:33,320 --> 00:38:36,800 | |
It's basically the entire text | |
670 | |
00:38:36,800 --> 00:38:38,400 | |
available publicly on the internet. | |
671 | |
00:38:39,940 --> 00:38:43,200 | |
Now a human child, a four-year-old, | |
672 | |
00:38:43,200 --> 00:38:46,280 | |
has been awake a total of 16,000 hours. | |
673 | |
00:38:46,280 --> 00:38:49,780 | |
That's what developmental psychologists tell me. | |
674 | |
00:38:50,640 --> 00:38:52,040 | |
Which by the way is not a lot of data, | |
675 | |
00:38:52,040 --> 00:38:54,040 | |
that's 30 minutes of YouTube uploads. | |
676 | |
00:38:56,940 --> 00:39:00,880 | |
And I don't know how much Instagram, I should. | |
677 | |
00:39:02,000 --> 00:39:05,140 | |
We have two million optic nerve fibers | |
678 | |
00:39:05,140 --> 00:39:07,640 | |
going to our brain through our eyes. | |
679 | |
00:39:07,640 --> 00:39:09,800 | |
The amount of information getting to the eyes is enormous | |
680 | |
00:39:09,800 --> 00:39:12,040 | |
because we have 100 million photosensors | |
681 | |
00:39:12,040 --> 00:39:13,540 | |
or something like that. | |
682 | |
00:39:13,540 --> 00:39:15,660 | |
But it's being reduced to squeeze down | |
683 | |
00:39:15,660 --> 00:39:18,100 | |
to the optical nerve before it gets to the brain. | |
684 | |
00:39:18,100 --> 00:39:20,540 | |
And that's about two million nerve fibers, | |
685 | |
00:39:20,540 --> 00:39:23,300 | |
each carrying a little less than one byte per second, | |
686 | |
00:39:23,300 --> 00:39:25,020 | |
a few bits per second, okay? | |
687 | |
00:39:25,020 --> 00:39:30,020 | |
So the volume of data there is about 10 to the 14 bytes, | |
688 | |
00:39:32,040 --> 00:39:32,880 | |
maybe a little less. | |
689 | |
00:39:32,880 --> 00:39:36,000 | |
It's the same order of magnitude as the biggest LLM. | |
690 | |
00:39:36,000 --> 00:39:38,900 | |
In four years, a child has seen more data | |
691 | |
00:39:40,260 --> 00:39:43,960 | |
about the real world than the biggest LLM trained | |
692 | |
00:39:43,960 --> 00:39:46,580 | |
on the entirety of all the publicly available texts | |
693 | |
00:39:46,580 --> 00:39:48,660 | |
on the internet that we take any of us, | |
694 | |
00:39:50,360 --> 00:39:52,560 | |
you know, hundreds of millennia to read through. | |
695 | |
00:39:53,500 --> 00:39:55,240 | |
So that tells you we're never gonna reach | |
696 | |
00:39:55,240 --> 00:39:57,120 | |
human level intelligence by training on text. | |
697 | |
00:39:57,120 --> 00:39:58,300 | |
It's just not happening. | |
698 | |
00:39:59,360 --> 00:40:01,960 | |
Okay, we need systems to really understand the world | |
699 | |
00:40:01,960 --> 00:40:05,900 | |
through high bandwidth input, like vision or touch. | |
700 | |
00:40:05,900 --> 00:40:07,320 | |
Okay, blind people can get smart | |
701 | |
00:40:07,320 --> 00:40:09,220 | |
because they have other senses. | |
702 | |
00:40:11,780 --> 00:40:13,880 | |
And in fact, you know, if you look at how long it takes | |
703 | |
00:40:13,880 --> 00:40:21,300 | |
For children, infants, to learn basic concepts about the real world, it takes several months. | |
704 | |
00:40:21,940 --> 00:40:29,500 | |
So a child will learn the difference between animate and inanimate objects within the first | |
705 | |
00:40:29,500 --> 00:40:34,040 | |
three months of life, opening their eyes. Object permanence appears really early, | |
706 | |
00:40:34,440 --> 00:40:39,980 | |
maybe around two months. Notions of solidity, rigidity, and stability and support, | |
707 | |
00:40:39,980 --> 00:40:45,340 | |
that's in the first six months. So this idea that, you know, this is not going to be stable is going | |
708 | |
00:40:45,340 --> 00:40:53,900 | |
to fall. And then notions of intuitive physics like gravity, inertia, conservation of momentum, | |
709 | |
00:40:53,900 --> 00:40:59,260 | |
this kind of stuff, that we have an intuitive level that any animal has too, that only pops | |
710 | |
00:40:59,260 --> 00:41:04,380 | |
up around nine months in baby humans, much earlier in baby goats and other animals. | |
711 | |
00:41:09,980 --> 00:41:14,940 | |
Most of that is through observation. There's not much interaction. You know, babies can hardly | |
712 | |
00:41:14,940 --> 00:41:20,380 | |
affect the world in the first four months of life. They do afterwards. If you put an eight-month-old | |
713 | |
00:41:20,380 --> 00:41:24,140 | |
baby on a chair with a bunch of toys, the first thing they'll do is throw the toys on the ground | |
714 | |
00:41:24,140 --> 00:41:28,460 | |
because that's how they do the experiment about gravity. You know, does it apply to this new thing | |
715 | |
00:41:28,460 --> 00:41:36,220 | |
I'm seeing on my chair? Okay, so there is a very natural idea which is to transpose the stuff that | |
716 | |
00:41:36,220 --> 00:41:38,820 | |
that has worked for text to video. | |
717 | |
00:41:38,820 --> 00:41:42,360 | |
Can we just train a generative model to learn to predict video? | |
718 | |
00:41:42,360 --> 00:41:44,760 | |
And then that system will just understand how the world works, | |
719 | |
00:41:44,760 --> 00:41:48,020 | |
because it's going to be able to predict what happens in the video. | |
720 | |
00:41:48,020 --> 00:41:53,640 | |
And it's been a bit of my obsession in terms of research for | |
721 | |
00:41:53,640 --> 00:41:56,760 | |
the last at least 15 years, if not more. | |
722 | |
00:41:56,760 --> 00:41:59,460 | |
Okay, so this predates LLMs and everything. | |
723 | |
00:41:59,460 --> 00:42:01,520 | |
Okay, this idea that you can learn by prediction, | |
724 | |
00:42:01,520 --> 00:42:03,120 | |
it's a very old concept in neuroscience, | |
725 | |
00:42:03,120 --> 00:42:05,720 | |
but it's something I've really been sort of, | |
726 | |
00:42:05,720 --> 00:42:08,480 | |
working on with my students, | |
727 | |
00:42:08,480 --> 00:42:11,520 | |
collaborators for many years. | |
728 | |
00:42:11,520 --> 00:42:15,280 | |
And the idea of course is to use a generative model, right? | |
729 | |
00:42:15,280 --> 00:42:18,640 | |
Give to a system a piece of video, | |
730 | |
00:42:19,240 --> 00:42:23,320 | |
and then try to predict what's going to happen next in the video. | |
731 | |
00:42:23,320 --> 00:42:28,000 | |
Just the same way that we train LLMs to predict what happens next in the text. | |
732 | |
00:42:28,800 --> 00:42:33,560 | |
Perhaps if you want the system to be kind of a role model, | |
733 | |
00:42:33,560 --> 00:42:37,180 | |
you can feed this model with an action variable, | |
734 | |
00:42:37,180 --> 00:42:38,680 | |
the A variable here, | |
735 | |
00:42:38,680 --> 00:42:42,040 | |
which in this case would simply be masking essentially. | |
736 | |
00:42:42,040 --> 00:42:43,780 | |
So take a video, mask a piece of it, | |
737 | |
00:42:43,780 --> 00:42:45,600 | |
let's say the second half of it, | |
738 | |
00:42:45,600 --> 00:42:47,080 | |
run it through some big neural net and | |
739 | |
00:42:47,080 --> 00:42:50,500 | |
train it to predict the second half of the full video. | |
740 | |
00:42:50,760 --> 00:42:54,740 | |
We tried for a good part of 15 years, | |
741 | |
00:42:54,740 --> 00:42:56,500 | |
it doesn't work. | |
742 | |
00:42:56,500 --> 00:42:59,620 | |
It doesn't work because there are many, | |
743 | |
00:42:59,620 --> 00:43:02,000 | |
many things that can happen in a video and a system of | |
744 | |
00:43:02,000 --> 00:43:04,000 | |
This type basically will just predict one thing. | |
745 | |
00:43:05,700 --> 00:43:07,880 | |
And so one way to deal with this problem | |
746 | |
00:43:07,880 --> 00:43:10,240 | |
of predicting one thing, so it's gonna predict one thing. | |
747 | |
00:43:10,240 --> 00:43:12,840 | |
So the best thing you can predict is the average | |
748 | |
00:43:12,840 --> 00:43:15,640 | |
of all the possible, plausible things that may happen. | |
749 | |
00:43:15,640 --> 00:43:16,620 | |
And you see an example here, | |
750 | |
00:43:16,620 --> 00:43:19,060 | |
that's an early paper in video prediction, | |
751 | |
00:43:19,060 --> 00:43:20,720 | |
trying to predict what's gonna happen | |
752 | |
00:43:20,720 --> 00:43:24,820 | |
is this really short six frame video with this little girl. | |
753 | |
00:43:24,820 --> 00:43:27,400 | |
The four frame, the first four frames are observed, | |
754 | |
00:43:27,400 --> 00:43:30,460 | |
the last two are predicted, and what you see is a blurry mess, | |
755 | |
00:43:30,460 --> 00:43:31,640 | |
because the system really cannot predict | |
756 | |
00:43:31,640 --> 00:43:34,200 | |
What's going to happen is we predict the average. | |
757 | |
00:43:34,400 --> 00:43:36,840 | |
You see this at the bottom as well, | |
758 | |
00:43:36,840 --> 00:43:38,880 | |
if you can play that video again. | |
759 | |
00:43:38,880 --> 00:43:41,400 | |
This is a top-down view of a highway, | |
760 | |
00:43:41,400 --> 00:43:43,600 | |
and the green things are like cars. | |
761 | |
00:43:43,600 --> 00:43:46,800 | |
The second column are predictions made by | |
762 | |
00:43:46,800 --> 00:43:49,280 | |
neural net trying to predict what's going to happen in that video. | |
763 | |
00:43:49,280 --> 00:43:52,960 | |
You see those blurry extending cars | |
764 | |
00:43:52,960 --> 00:43:55,720 | |
because it really cannot predict what's happening. | |
765 | |
00:43:55,720 --> 00:43:58,840 | |
So the columns on the right are | |
766 | |
00:43:58,840 --> 00:44:01,160 | |
a different model that has a latent variable which is | |
767 | |
00:44:01,160 --> 00:44:04,760 | |
designed to capture the variability between the potential prediction, | |
768 | |
00:44:04,760 --> 00:44:07,200 | |
and those predictions are not blurry. | |
769 | |
00:44:07,200 --> 00:44:14,180 | |
So we thought that we had a good solution to that problem five years ago with latent variables, | |
770 | |
00:44:14,180 --> 00:44:16,580 | |
but it turns out to not work for real video. | |
771 | |
00:44:16,580 --> 00:44:18,200 | |
It works for simple videos like this one, | |
772 | |
00:44:18,200 --> 00:44:20,980 | |
but it doesn't for real world. | |
773 | |
00:44:20,980 --> 00:44:24,120 | |
So we can't train this thing on video. | |
774 | |
00:44:24,120 --> 00:44:26,880 | |
So the solution to that problem is interesting, | |
775 | |
00:44:26,880 --> 00:44:30,060 | |
is to abandon the whole idea of generative models. | |
776 | |
00:44:30,060 --> 00:44:37,060 | |
Everybody is talking about generality model like it's the new Messiah. | |
777 | |
00:44:37,060 --> 00:44:41,420 | |
What I'm telling you today is forget about generality models. | |
778 | |
00:44:41,420 --> 00:44:45,120 | |
Okay. The solution to that problem, | |
779 | |
00:44:45,120 --> 00:44:48,280 | |
we think, is what we call joint embedding architectures, | |
780 | |
00:44:48,280 --> 00:44:51,680 | |
or more precisely joint embedding predictive architectures. | |
781 | |
00:44:51,680 --> 00:44:53,840 | |
This is really the way to build a world model. | |
782 | |
00:44:53,840 --> 00:44:56,180 | |
So what is this consistent? | |
783 | |
00:44:56,180 --> 00:44:58,000 | |
It's you take that video, | |
784 | |
00:44:58,000 --> 00:44:59,900 | |
you corrupt it, you mask a piece of it, | |
785 | |
00:44:59,900 --> 00:45:01,720 | |
for example, okay? | |
786 | |
00:45:01,720 --> 00:45:04,060 | |
And you run it through a big neural net, | |
787 | |
00:45:04,060 --> 00:45:05,920 | |
but what the big neural net is trained to do | |
788 | |
00:45:05,920 --> 00:45:08,520 | |
is not predict all the pixels in the video, | |
789 | |
00:45:08,520 --> 00:45:11,320 | |
it's trained to predict an abstract representation | |
790 | |
00:45:12,400 --> 00:45:14,360 | |
of the future of that video, okay? | |
791 | |
00:45:14,360 --> 00:45:16,280 | |
So you take the original video, | |
792 | |
00:45:16,280 --> 00:45:17,460 | |
you take the masked one, | |
793 | |
00:45:17,460 --> 00:45:18,960 | |
you run them through encoders, | |
794 | |
00:45:18,960 --> 00:45:21,520 | |
now you have abstract representations | |
795 | |
00:45:21,520 --> 00:45:24,920 | |
of the full video and the corrupted one, | |
796 | |
00:45:24,920 --> 00:45:26,820 | |
and you train a predictor | |
797 | |
00:45:26,820 --> 00:45:28,540 | |
to predict the representation of the full video, | |
798 | |
00:45:28,540 --> 00:45:30,900 | |
from the representation of the corrupted one. | |
799 | |
00:45:32,020 --> 00:45:32,820 | |
Okay. | |
800 | |
00:45:32,820 --> 00:45:33,700 | |
This is called JEPA. | |
801 | |
00:45:33,700 --> 00:45:35,660 | |
That means Joint Embedding Predictive Architecture. | |
802 | |
00:45:35,660 --> 00:45:37,580 | |
There's a bunch of papers from the last few years | |
803 | |
00:45:37,580 --> 00:45:41,340 | |
that my collaborators and I have published on this idea. | |
804 | |
00:45:41,340 --> 00:45:43,780 | |
And it solves the problem of having to predict | |
805 | |
00:45:43,780 --> 00:45:47,100 | |
all kinds of details that you really cannot predict. | |
806 | |
00:45:47,100 --> 00:45:49,580 | |
So if I were to take a video of this crowd, | |
807 | |
00:45:50,980 --> 00:45:52,940 | |
in fact I can take a video of this crowd. | |
808 | |
00:45:55,020 --> 00:45:57,380 | |
Okay, now I'm taking a video of you guys. | |
809 | |
00:45:57,380 --> 00:46:01,460 | |
Okay, and I slowly turn my head towards the right. | |
810 | |
00:46:03,440 --> 00:46:04,780 | |
Gonna shut down the video now. | |
811 | |
00:46:06,740 --> 00:46:09,860 | |
Certainly, a prediction system can predict this is a room, | |
812 | |
00:46:09,860 --> 00:46:13,280 | |
it's a conference room, there's people sitting everywhere. | |
813 | |
00:46:13,280 --> 00:46:16,420 | |
It may not be able to predict that all the chairs are full. | |
814 | |
00:46:16,420 --> 00:46:18,000 | |
It certainly cannot predict | |
815 | |
00:46:18,000 --> 00:46:20,080 | |
what every single one of you looks like. | |
816 | |
00:46:20,080 --> 00:46:21,060 | |
There's absolutely no way. | |
817 | |
00:46:21,060 --> 00:46:22,800 | |
It cannot predict what the texture on the wall | |
818 | |
00:46:22,800 --> 00:46:26,860 | |
is going to be, or even the color of the side. | |
819 | |
00:46:26,860 --> 00:46:30,200 | |
So there are things that are just completely unpredictable. | |
820 | |
00:46:30,200 --> 00:46:31,620 | |
You don't have the information to do it. | |
821 | |
00:46:31,620 --> 00:46:34,260 | |
And if you train a system to predict all those details, | |
822 | |
00:46:34,260 --> 00:46:36,240 | |
it's going to spend all of its resources | |
823 | |
00:46:36,240 --> 00:46:37,660 | |
predicting irrelevant details. | |
824 | |
00:46:38,540 --> 00:46:40,220 | |
So what a jet pad does when you train it, | |
825 | |
00:46:40,220 --> 00:46:41,980 | |
and I'm gonna tell you how you train this, | |
826 | |
00:46:41,980 --> 00:46:45,700 | |
is that it finds a trade-off between extracting | |
827 | |
00:46:45,700 --> 00:46:48,040 | |
as much information as possible from the input, | |
828 | |
00:46:48,040 --> 00:46:50,340 | |
but only extracting things that it can predict. | |
829 | |
00:46:53,260 --> 00:46:55,100 | |
And there is an issue with those kinds of architectures. | |
830 | |
00:46:55,100 --> 00:47:01,100 | |
Here is a contrast between the generative architecture that tried to reproduce Y directly | |
831 | |
00:47:01,100 --> 00:47:06,640 | |
and the joint embedding architecture which only tries to do prediction in representation | |
832 | |
00:47:06,640 --> 00:47:09,560 | |
space on the right. | |
833 | |
00:47:09,560 --> 00:47:14,480 | |
There's a problem with the joint embedding architecture and this is why we've only been | |
834 | |
00:47:14,480 --> 00:47:16,100 | |
working on this in recent years. | |
835 | |
00:47:16,100 --> 00:47:21,100 | |
It is the fact that if you just train the parameters of those neural nets to minimize | |
836 | |
00:47:21,100 --> 00:47:23,940 | |
the prediction error, it collapses. | |
837 | |
00:47:23,940 --> 00:47:27,340 | |
basically ignores the inputs X and Y. | |
838 | |
00:47:27,340 --> 00:47:29,400 | |
It makes prediction for SX and SY, | |
839 | |
00:47:29,400 --> 00:47:32,260 | |
the two representations that are constant. | |
840 | |
00:47:32,260 --> 00:47:34,180 | |
Another prediction problem is trivial. | |
841 | |
00:47:37,220 --> 00:47:39,200 | |
And that's not a good thing. | |
842 | |
00:47:39,200 --> 00:47:43,240 | |
So that's an example of this energy-based framework | |
843 | |
00:47:43,240 --> 00:47:44,960 | |
that I was describing earlier. | |
844 | |
00:47:46,060 --> 00:47:50,200 | |
It gives zero energy to every pair of XY, essentially. | |
845 | |
00:47:50,200 --> 00:47:51,420 | |
But what you want is zero energy | |
846 | |
00:47:51,420 --> 00:47:53,160 | |
for the pairs of XY you're training on, | |
847 | |
00:47:53,160 --> 00:47:55,940 | |
but higher energy for things that you don't train it on, | |
848 | |
00:47:55,940 --> 00:47:57,820 | |
and that's the hard part. | |
849 | |
00:47:57,820 --> 00:48:01,780 | |
So next I'm going to explain how you make that possible, | |
850 | |
00:48:01,780 --> 00:48:05,280 | |
how you make sure that the pairs of XY | |
851 | |
00:48:05,280 --> 00:48:07,480 | |
that are not compatible have a higher energy. | |
852 | |
00:48:09,740 --> 00:48:12,140 | |
There's variations of those architectures, | |
853 | |
00:48:12,140 --> 00:48:14,220 | |
some of which can be sort of have latent variables | |
854 | |
00:48:14,220 --> 00:48:17,140 | |
or have the action condition if you want | |
855 | |
00:48:17,140 --> 00:48:18,680 | |
to predict it to be a model. | |
856 | |
00:48:19,720 --> 00:48:22,240 | |
And there's been papers on this for many years now. | |
857 | |
00:48:22,240 --> 00:48:24,200 | |
The earliest joint embedding architecture actually | |
858 | |
00:48:24,200 --> 00:48:25,320 | |
is from the early 90s. | |
859 | |
00:48:25,320 --> 00:48:28,000 | |
It's a paper of mine about Siamese networks. | |
860 | |
00:48:30,060 --> 00:48:31,720 | |
But we're gonna have to train | |
861 | |
00:48:31,720 --> 00:48:34,240 | |
those sort of generic architectures. | |
862 | |
00:48:34,240 --> 00:48:36,400 | |
So how do we do this? | |
863 | |
00:48:37,440 --> 00:48:38,680 | |
So remember this picture, right? | |
864 | |
00:48:38,680 --> 00:48:41,260 | |
We wanna give low energy to stuff that are compatible, | |
865 | |
00:48:41,260 --> 00:48:43,260 | |
things that we observe, training sets, | |
866 | |
00:48:43,260 --> 00:48:44,940 | |
training samples, X and Y, | |
867 | |
00:48:44,940 --> 00:48:46,440 | |
higher energy to everything else. | |
868 | |
00:48:47,740 --> 00:48:48,860 | |
So there are two sets of methods, | |
869 | |
00:48:48,860 --> 00:48:51,840 | |
contrasting methods and what I call regularized methods. | |
870 | |
00:48:51,840 --> 00:49:00,640 | |
So contrastive method consists in basically generating contrastive pairs of X and Y that are not in the training set. | |
871 | |
00:49:01,520 --> 00:49:04,560 | |
So pick an X and pick another Y that's not compatible with it. | |
872 | |
00:49:04,560 --> 00:49:06,920 | |
And that gives you one of those green dots that you see flashing. | |
873 | |
00:49:08,040 --> 00:49:13,860 | |
And your loss function is going to consist in pushing down on the energy of the blue dots, which are the training samples, | |
874 | |
00:49:14,040 --> 00:49:17,760 | |
and then pushing up on the energy of the green dots, which are those contrastive samples. | |
875 | |
00:49:17,760 --> 00:49:24,120 | |
Okay, this is a good idea and there's a bunch of algorithms that people have used to train this. | |
876 | |
00:49:24,120 --> 00:49:29,200 | |
Some of them, for example, for joint embedding between images and text, are things like Clip | |
877 | |
00:49:29,200 --> 00:49:36,960 | |
from OpenAI. They use contrasting methods. Seem clear from a team at Google that includes Jeff | |
878 | |
00:49:36,960 --> 00:49:43,980 | |
Hinton. And then Siamese Nets back from the 90s that I used to advocate. The issue with contrasting | |
879 | |
00:49:43,980 --> 00:49:47,980 | |
methods is that the intrinsic dimension of the embedding that they produce is | |
880 | |
00:49:47,980 --> 00:49:53,460 | |
usually fairly low and so the representations that are learned by it | |
881 | |
00:49:53,460 --> 00:49:57,480 | |
are kind of degenerate a little bit. So I prefer the regularized method. What is | |
882 | |
00:49:57,480 --> 00:50:02,100 | |
the idea behind the regularized method? The idea is that you minimize the volume | |
883 | |
00:50:02,100 --> 00:50:07,980 | |
of space that can take low energy. So you have some sort of regularizer term in | |
884 | |
00:50:07,980 --> 00:50:11,580 | |
your loss function and that term basically measures the volume of stuff | |
885 | |
00:50:11,580 --> 00:50:17,180 | |
that has low energy and you try to minimize it. So what that means is that whenever you push down | |
886 | |
00:50:17,180 --> 00:50:22,140 | |
the energy of one region of that space, the rest has to go up because there's only a limited amount | |
887 | |
00:50:22,140 --> 00:50:29,740 | |
of low energy volume to go around. And you know that sounds a little abstract and mysterious, | |
888 | |
00:50:29,740 --> 00:50:35,660 | |
but in practice the way you do this is there's like a handful of methods to do this, | |
889 | |
00:50:35,660 --> 00:50:39,660 | |
which I'm going to explain in a second. Before that I'm going to tell you how you test how well | |
890 | |
00:50:39,660 --> 00:50:40,840 | |
how those systems work, right? | |
891 | |
00:50:40,840 --> 00:50:43,640 | |
So in the context of image recognition, | |
892 | |
00:50:43,640 --> 00:50:46,240 | |
you give two images that you know are the same image, | |
893 | |
00:50:46,240 --> 00:50:48,740 | |
either, so you take an image and you corrupt it, | |
894 | |
00:50:48,740 --> 00:50:50,980 | |
or you transform it in some way. | |
895 | |
00:50:50,980 --> 00:50:52,660 | |
You change the scale, you rotate it, | |
896 | |
00:50:52,660 --> 00:50:53,820 | |
you change the colors a little bit, | |
897 | |
00:50:53,820 --> 00:50:56,060 | |
maybe you mask parts of it, okay? | |
898 | |
00:50:56,060 --> 00:50:58,840 | |
And then you train an encoder on a predictor | |
899 | |
00:50:58,840 --> 00:51:01,020 | |
so that the predictor predicts the representation | |
900 | |
00:51:01,020 --> 00:51:03,220 | |
of the full image from the representation | |
901 | |
00:51:03,220 --> 00:51:05,940 | |
of the corrupted one. | |
902 | |
00:51:05,940 --> 00:51:07,600 | |
And then once the system is trained, | |
903 | |
00:51:07,600 --> 00:51:09,020 | |
you chop off the predictor, | |
904 | |
00:51:09,020 --> 00:51:11,520 | |
you use the encoder as input to a classifier, | |
905 | |
00:51:11,520 --> 00:51:14,440 | |
and you train a supervised classifier to do things | |
906 | |
00:51:14,440 --> 00:51:17,020 | |
like object recognition or something of that type. | |
907 | |
00:51:17,020 --> 00:51:19,940 | |
So that's a way of measuring the quality of the features | |
908 | |
00:51:19,940 --> 00:51:24,060 | |
that have been learned by the system. | |
909 | |
00:51:24,060 --> 00:51:28,560 | |
There's been a number of papers on this, | |
910 | |
00:51:28,560 --> 00:51:33,220 | |
and what has been transpiring is that those methods work really | |
911 | |
00:51:33,220 --> 00:51:36,180 | |
well to train a system to extract | |
912 | |
00:51:36,180 --> 00:51:37,660 | |
Generate features from images, | |
913 | |
00:51:37,660 --> 00:51:39,660 | |
the joint embedding architectures. | |
914 | |
00:51:39,660 --> 00:51:42,020 | |
There's been a lot of work also on | |
915 | |
00:51:42,020 --> 00:51:45,080 | |
generative architectures like autoencoders, | |
916 | |
00:51:45,080 --> 00:51:47,500 | |
variational autoencoders, VQVAEs, | |
917 | |
00:51:47,500 --> 00:51:49,620 | |
masked autoencoders, denosing autoencoders, | |
918 | |
00:51:49,620 --> 00:51:51,780 | |
all kinds of techniques of this type that basically, | |
919 | |
00:51:51,780 --> 00:51:53,860 | |
you give a corrupted version of an image, | |
920 | |
00:51:53,860 --> 00:51:55,460 | |
and then you train the system to | |
921 | |
00:51:55,460 --> 00:51:57,780 | |
recover the full image at the pixel level. | |
922 | |
00:51:57,780 --> 00:52:00,180 | |
Those methods do not work nearly as | |
923 | |
00:52:00,180 --> 00:52:02,260 | |
well as the joint embedding methods. | |
924 | |
00:52:02,260 --> 00:52:04,700 | |
We discovered this five or six years ago, | |
925 | |
00:52:04,700 --> 00:52:09,460 | |
not just us, but there was an accumulating amount of evidence showing that joint invading | |
926 | |
00:52:09,460 --> 00:52:17,300 | |
was really superior to reconstruction based systems, so to generative architectures. | |
927 | |
00:52:17,300 --> 00:52:20,760 | |
And at the time, the methods for training were only contrastive. | |
928 | |
00:52:20,760 --> 00:52:25,260 | |
But now we've found some other techniques, and one technique in particular that, or one | |
929 | |
00:52:25,260 --> 00:52:30,940 | |
set of techniques that attempt to maximize some measure of information, information content | |
930 | |
00:52:30,940 --> 00:52:32,640 | |
coming out of the encoder. | |
931 | |
00:52:32,640 --> 00:52:36,320 | |
So one of the criteria used for training is this minus i, | |
932 | |
00:52:36,320 --> 00:52:38,040 | |
the measure of information content. | |
933 | |
00:52:38,040 --> 00:52:39,660 | |
Since we minimize cost function, | |
934 | |
00:52:39,660 --> 00:52:40,720 | |
there is a minus sign in front, | |
935 | |
00:52:40,720 --> 00:52:42,760 | |
so you maximize information content. | |
936 | |
00:52:42,760 --> 00:52:44,500 | |
How do we do this? | |
937 | |
00:52:44,500 --> 00:52:47,060 | |
So one simple trick that we've used is something called | |
938 | |
00:52:47,060 --> 00:52:49,640 | |
variance covariance regularization. | |
939 | |
00:52:49,640 --> 00:52:52,540 | |
Or in the case where you don't have predictor, | |
940 | |
00:52:52,540 --> 00:52:55,880 | |
it's Vcreg, variance invariance covariance regularization. | |
941 | |
00:52:55,880 --> 00:52:57,900 | |
And there the idea is you take | |
942 | |
00:52:57,900 --> 00:53:00,260 | |
the representation coming out of the encoder and you say, | |
943 | |
00:53:00,260 --> 00:53:04,900 | |
First of all, you should not collapse to a fixed set of values. | |
944 | |
00:53:04,900 --> 00:53:07,500 | |
So the variance of each variable coming out of | |
945 | |
00:53:07,500 --> 00:53:10,600 | |
the encoder should be at least one, let's say. | |
946 | |
00:53:10,600 --> 00:53:13,400 | |
Okay. Now the system can still cheat and not produce | |
947 | |
00:53:13,400 --> 00:53:16,020 | |
very informative outputs by basically producing | |
948 | |
00:53:16,020 --> 00:53:18,860 | |
the same variable or very correlated variable for | |
949 | |
00:53:18,860 --> 00:53:22,620 | |
all the dimensions of the output representation. | |
950 | |
00:53:22,620 --> 00:53:26,900 | |
So another criterion tries to decorrelate those variables. | |
951 | |
00:53:26,900 --> 00:53:29,760 | |
And in fact, we use a trick that we expand the dimension, | |
952 | |
00:53:29,760 --> 00:53:32,200 | |
We take the representation, run it through a neural net | |
953 | |
00:53:32,200 --> 00:53:33,680 | |
that expands the dimension, | |
954 | |
00:53:33,680 --> 00:53:35,000 | |
and then decorrelate in that space, | |
955 | |
00:53:35,000 --> 00:53:37,000 | |
and that has the effect of actually making | |
956 | |
00:53:37,000 --> 00:53:39,620 | |
the original variable more independent of each other, | |
957 | |
00:53:39,620 --> 00:53:41,080 | |
not just uncorrelated. | |
958 | |
00:53:41,960 --> 00:53:43,920 | |
So it's a bit of a hack, | |
959 | |
00:53:43,920 --> 00:53:46,000 | |
because what we're trying to do here | |
960 | |
00:53:46,000 --> 00:53:47,680 | |
is maximizing information content, | |
961 | |
00:53:47,680 --> 00:53:49,620 | |
and what we should have to be able to do this | |
962 | |
00:53:49,620 --> 00:53:52,280 | |
is a lower bound on information content. | |
963 | |
00:53:52,280 --> 00:53:54,040 | |
But what I'm describing here | |
964 | |
00:53:54,040 --> 00:53:56,680 | |
is an upper bound on information content. | |
965 | |
00:53:56,680 --> 00:53:58,280 | |
So we're maximizing an upper bound, | |
966 | |
00:53:58,280 --> 00:54:05,720 | |
Then we cross our fingers that the actual information content will follow. | |
967 | |
00:54:05,720 --> 00:54:06,520 | |
Okay. | |
968 | |
00:54:06,520 --> 00:54:09,720 | |
And it works. | |
969 | |
00:54:09,720 --> 00:54:13,880 | |
So that's one set of techniques. | |
970 | |
00:54:13,880 --> 00:54:15,160 | |
I'm going to skip the theory. | |
971 | |
00:54:15,160 --> 00:54:18,200 | |
There is another set of method called distillations, | |
972 | |
00:54:18,200 --> 00:54:19,880 | |
and those have proved to be extremely efficient. | |
973 | |
00:54:21,080 --> 00:54:25,160 | |
And there, it's another hack, and we only have partial, | |
974 | |
00:54:25,160 --> 00:54:29,400 | |
At least in my opinion partial theoretical understanding of why it works, but it does work. | |
975 | |
00:54:30,760 --> 00:54:35,640 | |
In there we share the weights between the two encoders with a technique called exponential | |
976 | |
00:54:35,640 --> 00:54:40,440 | |
moving average. So one encoder has the weights that are basically a temporal average of the | |
977 | |
00:54:40,440 --> 00:54:44,680 | |
weights of the other one for mysterious reasons. And we train the whole thing but we don't back | |
978 | |
00:54:44,680 --> 00:54:50,280 | |
propagate gradient to the one that gets this moving average that gets the full input. | |
979 | |
00:54:50,280 --> 00:54:54,180 | |
And somehow this does not collapse and it works really well. | |
980 | |
00:54:54,180 --> 00:54:56,020 | |
It's called a distillation method. | |
981 | |
00:54:56,020 --> 00:54:58,020 | |
There's various versions of it. | |
982 | |
00:54:58,020 --> 00:55:05,780 | |
Cinsiam, BYOL from DeepMind, Dinov2 from my colleagues in Paris at Meta, iJepa and VJepa | |
983 | |
00:55:05,780 --> 00:55:09,780 | |
from the people at Meta who work with me. | |
984 | |
00:55:09,780 --> 00:55:10,780 | |
This works amazingly well. | |
985 | |
00:55:10,780 --> 00:55:16,300 | |
It works so well, in fact, the Dinov2 version works incredibly well. | |
986 | |
00:55:16,300 --> 00:55:18,780 | |
It's a generic feature extractor for images. | |
987 | |
00:55:18,780 --> 00:55:21,580 | |
If you have some random computer vision problem, | |
988 | |
00:55:21,580 --> 00:55:23,540 | |
and no one has trained a system for that, | |
989 | |
00:55:23,540 --> 00:55:26,020 | |
just download Dinov2, it will extract features | |
990 | |
00:55:26,020 --> 00:55:28,280 | |
from your images, and then train a very simple | |
991 | |
00:55:28,280 --> 00:55:30,780 | |
classifier head on top of it with just a few examples, | |
992 | |
00:55:30,780 --> 00:55:33,960 | |
and it will likely solve your vision problem. | |
993 | |
00:55:33,960 --> 00:55:36,320 | |
An example of this is, I'm not gonna bore you | |
994 | |
00:55:36,320 --> 00:55:39,200 | |
with tables of results, but example of this | |
995 | |
00:55:39,200 --> 00:55:42,180 | |
is a collaborator at Meta, Camille Coupri, | |
996 | |
00:55:42,180 --> 00:55:47,020 | |
who got satellite imaging images of the entire world, | |
997 | |
00:55:47,020 --> 00:55:50,020 | |
you know, in various frequency bands. | |
998 | |
00:55:50,020 --> 00:55:52,020 | |
And she also got LiDAR data. | |
999 | |
00:55:52,020 --> 00:55:55,020 | |
So the LiDAR data gives you, for a little piece of the world, | |
1000 | |
00:55:55,020 --> 00:56:00,020 | |
LiDAR data gives you the height of the canopy of vegetation. | |
1001 | |
00:56:00,020 --> 00:56:05,020 | |
And so she took the Dino features, applied them to the entire world, | |
1002 | |
00:56:05,020 --> 00:56:09,020 | |
and then used a trained classifier that was trained on the LiDAR data, | |
1003 | |
00:56:09,020 --> 00:56:12,020 | |
on the small amount of data, but applied it to the entire world. | |
1004 | |
00:56:12,020 --> 00:56:16,020 | |
And now what she has is an estimate of the height of the canopy for the entire Earth. | |
1005 | |
00:56:16,020 --> 00:56:23,220 | |
What that allows to compute is the an estimate of the amount of carbon captured in vegetation, | |
1006 | |
00:56:23,220 --> 00:56:29,220 | |
which is a very interesting piece of data for climate change. So that's an example. There's | |
1007 | |
00:56:29,220 --> 00:56:34,340 | |
other examples in medical imaging, in biological imaging, where Dino has been used for some success. | |
1008 | |
00:56:35,060 --> 00:56:39,940 | |
But this distillation method called IGEPA that I briefly described earlier works extremely well | |
1009 | |
00:56:39,940 --> 00:56:45,620 | |
to learn visual features. Again, I'm not going to bore you with details. It's really much better than | |
1010 | |
00:56:45,620 --> 00:56:48,560 | |
and the methods that are based on reconstruction. | |
1011 | |
00:56:48,560 --> 00:56:52,860 | |
Of course, the next thing we did was try to apply this to video. | |
1012 | |
00:56:52,860 --> 00:56:54,120 | |
Can we apply this to video? | |
1013 | |
00:56:54,120 --> 00:56:56,360 | |
So it turns out if you train a system of this type to make | |
1014 | |
00:56:56,360 --> 00:56:57,660 | |
temporal prediction in video, | |
1015 | |
00:56:57,660 --> 00:56:58,880 | |
it doesn't work very well. | |
1016 | |
00:56:58,880 --> 00:57:02,420 | |
You have to make it do spatial prediction, | |
1017 | |
00:57:02,420 --> 00:57:04,000 | |
which is very strange. | |
1018 | |
00:57:04,000 --> 00:57:06,840 | |
There, the features that are learned are really great. | |
1019 | |
00:57:06,840 --> 00:57:10,640 | |
You get good performance for that system when you use the | |
1020 | |
00:57:10,640 --> 00:57:13,560 | |
representation to classify actions in | |
1021 | |
00:57:13,560 --> 00:57:16,060 | |
in videos and things of that type. | |
1022 | |
00:57:17,120 --> 00:57:21,540 | |
We even have tests now that the paper is being completed | |
1023 | |
00:57:21,540 --> 00:57:24,520 | |
that show that those systems have some level of common sense | |
1024 | |
00:57:24,520 --> 00:57:25,460 | |
and physical intuition. | |
1025 | |
00:57:25,460 --> 00:57:27,880 | |
It shows them videos that are impossible because, | |
1026 | |
00:57:27,880 --> 00:57:30,260 | |
for example, an object spontaneously disappears | |
1027 | |
00:57:30,260 --> 00:57:31,300 | |
or something like that. | |
1028 | |
00:57:31,300 --> 00:57:32,940 | |
They say, whoa, something strange happened. | |
1029 | |
00:57:32,940 --> 00:57:34,160 | |
Their prediction error goes up. | |
1030 | |
00:57:34,160 --> 00:57:37,660 | |
And so those systems really are able to learn | |
1031 | |
00:57:37,660 --> 00:57:39,960 | |
some basic concepts about the world. | |
1032 | |
00:57:39,960 --> 00:57:50,280 | |
But then the last thing I want to say is systems of this type that are capable of, that basically | |
1033 | |
00:57:50,280 --> 00:57:53,240 | |
we can use to train a world model and we can use those world models for planning. | |
1034 | |
00:57:53,240 --> 00:57:54,240 | |
So this is new. | |
1035 | |
00:57:54,240 --> 00:57:57,240 | |
I haven't presented this yet. | |
1036 | |
00:57:57,240 --> 00:58:03,760 | |
The paper has been submitted, but this is the first time I talk publicly in English about | |
1037 | |
00:58:03,760 --> 00:58:04,760 | |
it. | |
1038 | |
00:58:09,960 --> 00:58:16,680 | |
the preview. So this is work by a student, PhD student NYU, | |
1039 | |
00:58:16,680 --> 00:58:21,880 | |
Gauru Eijou, who is co-advised by Masef and Lara Pinto, and she did a lot of this work | |
1040 | |
00:58:21,880 --> 00:58:31,080 | |
while she was an intern at Meta, and Hengai Pan, who's also a student. And the basic architecture | |
1041 | |
00:58:31,080 --> 00:58:37,240 | |
here is that we use the features from Dinov2, okay, pre-trained, and we train a world model on | |
1042 | |
00:58:37,240 --> 00:58:39,440 | |
on top of it, which is action conditioned. | |
1043 | |
00:58:39,440 --> 00:58:44,240 | |
So basically, we take a picture of the world, | |
1044 | |
00:58:44,240 --> 00:58:46,740 | |
or the environment, whatever it is, | |
1045 | |
00:58:46,740 --> 00:58:50,540 | |
and then feed an action that we're going to take in | |
1046 | |
00:58:50,540 --> 00:58:53,240 | |
that environment and then observe | |
1047 | |
00:58:53,240 --> 00:58:57,540 | |
the result in the environment in terms of Dino features, | |
1048 | |
00:58:57,540 --> 00:59:00,500 | |
and then train the predictor to predict | |
1049 | |
00:59:00,500 --> 00:59:03,860 | |
the representation after the action as | |
1050 | |
00:59:03,860 --> 00:59:05,380 | |
a function of the input, | |
1051 | |
00:59:05,380 --> 00:59:07,700 | |
the previous state and the action. | |
1052 | |
00:59:07,700 --> 00:59:10,220 | |
So the predictor function takes | |
1053 | |
00:59:10,220 --> 00:59:11,860 | |
the previous state and the action and predicts | |
1054 | |
00:59:11,860 --> 00:59:13,700 | |
the next state essentially. | |
1055 | |
00:59:13,700 --> 00:59:15,420 | |
Then once we have that system, | |
1056 | |
00:59:15,420 --> 00:59:18,620 | |
we can do this optimization procedure I was telling you about, | |
1057 | |
00:59:18,620 --> 00:59:22,460 | |
to plan a sequence of actions to arrive at a particular result. | |
1058 | |
00:59:22,460 --> 00:59:25,700 | |
The result is simply a Euclidean distance | |
1059 | |
00:59:25,700 --> 00:59:27,220 | |
between a predicted state, | |
1060 | |
00:59:27,220 --> 00:59:29,700 | |
end state, and a target state. | |
1061 | |
00:59:29,700 --> 00:59:32,060 | |
The way we compute the target state is that we show | |
1062 | |
00:59:32,060 --> 00:59:33,740 | |
an image to the encoder and we tell it, | |
1063 | |
00:59:33,740 --> 00:59:37,220 | |
you know, this representation is your target representation. | |
1064 | |
00:59:37,220 --> 00:59:40,060 | |
Take a sequence of actions so that the predicted state | |
1065 | |
00:59:40,060 --> 00:59:42,540 | |
matches that state. | |
1066 | |
00:59:43,480 --> 00:59:45,360 | |
So we've tried this on several tasks. | |
1067 | |
00:59:45,360 --> 00:59:46,940 | |
So one of them is just, you know, | |
1068 | |
00:59:46,940 --> 00:59:49,620 | |
moving a dot through a simple maze. | |
1069 | |
00:59:49,620 --> 00:59:52,260 | |
Another one is moving a little, | |
1070 | |
00:59:52,260 --> 00:59:53,500 | |
let me repeat this video, | |
1071 | |
00:59:54,760 --> 00:59:59,260 | |
moving a little T object by pushing on it in various places | |
1072 | |
00:59:59,260 --> 01:00:01,180 | |
so that it's in a particular position. | |
1073 | |
01:00:01,180 --> 01:00:02,640 | |
That's called a push T problem. | |
1074 | |
01:00:02,640 --> 01:00:07,640 | |
And then other task of navigating through the environment, | |
1075 | |
01:00:07,640 --> 01:00:09,200 | |
going through a door in a wall, | |
1076 | |
01:00:09,200 --> 01:00:12,480 | |
and then pushing on sort of deformable objects | |
1077 | |
01:00:12,480 --> 01:00:14,220 | |
so they adopt a particular shape. | |
1078 | |
01:00:14,220 --> 01:00:16,100 | |
Okay, and I'll show you a more impressive example | |
1079 | |
01:00:16,100 --> 01:00:16,860 | |
in this one. | |
1080 | |
01:00:16,860 --> 01:00:20,660 | |
Okay, so the task, we can collect artificial data | |
1081 | |
01:00:20,660 --> 01:00:23,760 | |
because those are virtual environments | |
1082 | |
01:00:23,760 --> 01:00:25,160 | |
that we can simulate. | |
1083 | |
01:00:25,160 --> 01:00:26,780 | |
And then we experimented with various systems | |
1084 | |
01:00:26,780 --> 01:00:30,640 | |
that have been proposed in the past to solve that problem. | |
1085 | |
01:00:30,640 --> 01:00:36,000 | |
Dreamer V3 is probably one of the most advanced one from DeepMind, | |
1086 | |
01:00:36,000 --> 01:00:39,000 | |
from Danny R. Hefner at DeepMind. | |
1087 | |
01:00:39,000 --> 01:00:42,200 | |
And what you see here is visualization through | |
1088 | |
01:00:42,200 --> 01:00:45,600 | |
a decoder of the predicted state for a sequence of actions. | |
1089 | |
01:00:45,600 --> 01:00:47,240 | |
So at the top is a ground truth. | |
1090 | |
01:00:47,240 --> 01:00:53,240 | |
You execute a sequence of actions and see the result in the simulator. | |
1091 | |
01:00:53,240 --> 01:00:58,280 | |
And then each row is the result of a prediction by one of those models. | |
1092 | |
01:00:58,280 --> 01:01:01,280 | |
And what you see is some predictions become blurry, | |
1093 | |
01:01:01,280 --> 01:01:04,280 | |
some predictions become kind of weird. | |
1094 | |
01:01:04,280 --> 01:01:08,280 | |
Ours is pretty good, Iris is pretty good, | |
1095 | |
01:01:08,280 --> 01:01:12,280 | |
Dreamer v3 not so great. | |
1096 | |
01:01:12,280 --> 01:01:14,280 | |
This is the most interesting task. | |
1097 | |
01:01:14,280 --> 01:01:17,280 | |
It's called the granular environment, | |
1098 | |
01:01:17,280 --> 01:01:21,280 | |
and it's basically a bunch of blue chips on the table. | |
1099 | |
01:01:21,280 --> 01:01:24,280 | |
And an action is a motion by a robot arm, | |
1100 | |
01:01:24,280 --> 01:01:26,280 | |
which goes down on the table, | |
1101 | |
01:01:26,280 --> 01:01:29,740 | |
moves by some Delta X, Delta Y, and then lifts. | |
1102 | |
01:01:29,740 --> 01:01:31,620 | |
That's an action, it's four numbers. | |
1103 | |
01:01:31,620 --> 01:01:38,520 | |
X, Y, where you touch the table, Delta X, Delta Y, lift. | |
1104 | |
01:01:38,520 --> 01:01:41,600 | |
Okay. The question is, | |
1105 | |
01:01:41,600 --> 01:01:45,000 | |
so you can train a world model by just putting | |
1106 | |
01:01:45,000 --> 01:01:47,180 | |
a bunch of chips in random position and then taking | |
1107 | |
01:01:47,180 --> 01:01:49,000 | |
a random action and then observing the result, | |
1108 | |
01:01:49,000 --> 01:01:50,980 | |
and you train the predictor this way. | |
1109 | |
01:01:50,980 --> 01:01:53,960 | |
Once the predictor is trained, | |
1110 | |
01:01:53,960 --> 01:01:58,280 | |
So those are results of various techniques of planning. | |
1111 | |
01:01:58,280 --> 01:02:00,960 | |
So you can use the world model for planning a sequence of | |
1112 | |
01:02:00,960 --> 01:02:03,400 | |
actions to arrive at a particular goal. | |
1113 | |
01:02:03,400 --> 01:02:05,680 | |
So this is for a point-based Christian world, | |
1114 | |
01:02:05,680 --> 01:02:08,380 | |
but you might want to look at the other one, the granular. | |
1115 | |
01:02:08,380 --> 01:02:14,580 | |
So this is the, what's called a chamfer distance between | |
1116 | |
01:02:14,580 --> 01:02:22,900 | |
the end state in the image space of all the grains, if you want, | |
1117 | |
01:02:22,900 --> 01:02:27,320 | |
and the target measured through a chamfer distance. | |
1118 | |
01:02:27,320 --> 01:02:29,080 | |
And what you see is the, our method, | |
1119 | |
01:02:29,080 --> 01:02:30,340 | |
which is the blue one, | |
1120 | |
01:02:30,340 --> 01:02:32,740 | |
has much, much lower final error | |
1121 | |
01:02:32,740 --> 01:02:34,760 | |
than the other methods that we compared it with, | |
1122 | |
01:02:34,760 --> 01:02:37,180 | |
Dreamer v3 and TDNPC2. | |
1123 | |
01:02:37,180 --> 01:02:40,660 | |
And TDNPC2 is a method that actually requires, | |
1124 | |
01:02:40,660 --> 01:02:42,100 | |
needs to be task specific, | |
1125 | |
01:02:42,100 --> 01:02:45,400 | |
so it's not as general as Dino World Model. | |
1126 | |
01:02:46,700 --> 01:02:49,820 | |
So here's a little demo of the system in action | |
1127 | |
01:02:49,820 --> 01:02:52,120 | |
for the various tasks. | |
1128 | |
01:02:52,120 --> 01:02:53,800 | |
Let me play this again. | |
1129 | |
01:02:53,800 --> 01:02:55,380 | |
Look at the push T. | |
1130 | |
01:02:55,380 --> 01:03:00,380 | |
Okay, so you see the dot moving in discrete steps | |
1131 | |
01:03:01,580 --> 01:03:04,840 | |
because for every tick of the simulation, | |
1132 | |
01:03:04,840 --> 01:03:07,540 | |
there is the same action is repeated five times. | |
1133 | |
01:03:07,540 --> 01:03:09,560 | |
So the actions are only produced | |
1134 | |
01:03:09,560 --> 01:03:11,220 | |
like every five time steps. | |
1135 | |
01:03:11,220 --> 01:03:13,240 | |
But it gets to the target. | |
1136 | |
01:03:13,240 --> 01:03:17,480 | |
The target is represented on the right, | |
1137 | |
01:03:17,480 --> 01:03:19,300 | |
and it actually kind of presents. | |
1138 | |
01:03:19,300 --> 01:03:22,500 | |
So this is for the granular in particular. | |
1139 | |
01:03:22,500 --> 01:03:26,400 | |
So the target is represented at the right. | |
1140 | |
01:03:26,400 --> 01:03:28,600 | |
And let me play this again. | |
1141 | |
01:03:28,600 --> 01:03:32,020 | |
We start from a random configuration of the chips, | |
1142 | |
01:03:32,020 --> 01:03:33,620 | |
and the system kind of pushes | |
1143 | |
01:03:33,620 --> 01:03:35,080 | |
the chips using those actions. | |
1144 | |
01:03:35,080 --> 01:03:35,980 | |
You don't see the actions, | |
1145 | |
01:03:35,980 --> 01:03:38,940 | |
but you only see the result by pushing | |
1146 | |
01:03:38,940 --> 01:03:40,220 | |
them so that they look like a square. | |
1147 | |
01:03:40,220 --> 01:03:42,380 | |
Now what's interesting about this is that it's | |
1148 | |
01:03:42,380 --> 01:03:43,680 | |
completely open loop. | |
1149 | |
01:03:43,680 --> 01:03:48,300 | |
So the system basically looks at the initial condition, | |
1150 | |
01:03:48,300 --> 01:03:49,820 | |
imagine the sequence of actions, | |
1151 | |
01:03:49,820 --> 01:03:52,280 | |
and then executes those actions blindly. | |
1152 | |
01:03:52,280 --> 01:03:54,080 | |
What you see here is a result of | |
1153 | |
01:03:54,080 --> 01:03:56,500 | |
executing those actions, open loop, | |
1154 | |
01:03:56,500 --> 01:03:58,360 | |
closing your eyes. | |
1155 | |
01:03:58,360 --> 01:04:00,360 | |
It's pretty cool. | |
1156 | |
01:04:00,360 --> 01:04:03,220 | |
All right, coming to the end now. | |
1157 | |
01:04:03,220 --> 01:04:07,160 | |
So I have five recommendations. | |
1158 | |
01:04:07,160 --> 01:04:12,180 | |
Abandoned generative models in favor of those JEPA. | |
1159 | |
01:04:12,180 --> 01:04:14,580 | |
Abandoned probabilistic models in favor of | |
1160 | |
01:04:14,580 --> 01:04:15,620 | |
those energy-based models. | |
1161 | |
01:04:15,620 --> 01:04:17,900 | |
So something I haven't said is that in this context, | |
1162 | |
01:04:17,900 --> 01:04:20,460 | |
you can't really do probabilistic modeling, | |
1163 | |
01:04:20,460 --> 01:04:21,360 | |
it's intractable. | |
1164 | |
01:04:22,640 --> 01:04:24,820 | |
Abandon contrastive methods | |
1165 | |
01:04:24,820 --> 01:04:28,720 | |
in favor of those regularized methods. | |
1166 | |
01:04:28,720 --> 01:04:30,200 | |
And of course, abandon reinforcement learning, | |
1167 | |
01:04:30,200 --> 01:04:32,140 | |
but that I've been saying for 10 years. | |
1168 | |
01:04:33,480 --> 01:04:36,180 | |
And so if you're interested in human level AI, | |
1169 | |
01:04:36,180 --> 01:04:37,940 | |
don't work on LLMs. | |
1170 | |
01:04:37,940 --> 01:04:40,800 | |
You're a grad student, you're studying a PhD in AI, | |
1171 | |
01:04:40,800 --> 01:04:42,220 | |
do not work on LLMs. | |
1172 | |
01:04:44,180 --> 01:04:45,240 | |
It's not interesting. | |
1173 | |
01:04:45,240 --> 01:04:51,640 | |
I mean, first of all, it's not that interesting because it's not going to be the next revolution in AI. | |
1174 | |
01:04:51,640 --> 01:04:55,640 | |
It's not going to help systems understand the physical world and everything. | |
1175 | |
01:04:55,640 --> 01:05:05,840 | |
But it's also a very dangerous thing to do because there is enormous teams in industry with billions of dollars of resources working on this. | |
1176 | |
01:05:05,840 --> 01:05:09,040 | |
There's nothing you can bring to the table. Absolutely nothing. | |
1177 | |
01:05:15,240 --> 01:05:20,780 | |
working on LLMs, but the lifetime of this is going to be three years. | |
1178 | |
01:05:20,780 --> 01:05:26,420 | |
Three, five years from now, my prediction is no one in their right mind would use LLMs | |
1179 | |
01:05:26,420 --> 01:05:27,880 | |
in the form that they exist today. | |
1180 | |
01:05:27,880 --> 01:05:30,360 | |
I mean, they would be used as a component of a bigger system, | |
1181 | |
01:05:30,360 --> 01:05:33,820 | |
but the main architecture would be different. | |
1182 | |
01:05:35,760 --> 01:05:38,320 | |
There's a lot of problems to solve with this, | |
1183 | |
01:05:38,320 --> 01:05:41,700 | |
which I kind of sweat under the rug, | |
1184 | |
01:05:41,700 --> 01:05:43,700 | |
and I'm not going to go through the laundry list, | |
1185 | |
01:05:43,700 --> 01:05:45,740 | |
but we don't know how to do hierarchical planning, | |
1186 | |
01:05:45,740 --> 01:05:47,880 | |
for example. So here is a good PhD topic, | |
1187 | |
01:05:47,880 --> 01:05:49,520 | |
if you're interested in this. | |
1188 | |
01:05:49,520 --> 01:05:54,240 | |
Just try to crack the nut of hierarchical planning. | |
1189 | |
01:05:56,340 --> 01:05:59,340 | |
There's all kinds of foundation, | |
1190 | |
01:05:59,340 --> 01:06:01,720 | |
theoretical issues with what I talked about here, | |
1191 | |
01:06:01,720 --> 01:06:03,600 | |
and energy-based models and things like this. | |
1192 | |
01:06:03,600 --> 01:06:05,840 | |
How to design objectives for SSL so | |
1193 | |
01:06:05,840 --> 01:06:08,600 | |
that the systems are driven to learn the right thing. | |
1194 | |
01:06:08,600 --> 01:06:11,760 | |
I've only talked about information maximization, | |
1195 | |
01:06:11,760 --> 01:06:13,560 | |
but there is all kinds of other things. | |
1196 | |
01:06:13,560 --> 01:06:17,720 | |
It's a little bit of RL you might need to do for adjusting the world model in real time. | |
1197 | |
01:06:18,840 --> 01:06:24,200 | |
But then, if we succeed in this program, which may take the better part of the next decade, | |
1198 | |
01:06:25,080 --> 01:06:34,200 | |
we might have virtual assistant that has human level AI. What I think though, is that those | |
1199 | |
01:06:34,200 --> 01:06:39,080 | |
platforms need to be open source. And so this is the political part of the talk, which is going to | |
1200 | |
01:06:39,080 --> 01:06:45,480 | |
be very short. You know, we need, those platforms are incredibly, you know, LLMs or future AI | |
1201 | |
01:06:45,480 --> 01:06:51,800 | |
systems are incredibly expensive to train, the basic foundation models. So only a few companies | |
1202 | |
01:06:51,800 --> 01:06:58,120 | |
in the world can do it. And the problem that we're facing now is that the publicly available | |
1203 | |
01:06:58,120 --> 01:07:04,360 | |
data on the internet is not what we want, because it's mostly English. I mean, there is other | |
1204 | |
01:07:04,360 --> 01:07:09,400 | |
other languages obviously, but for various reasons, regulatory reasons, all kinds of problems, | |
1205 | |
01:07:09,400 --> 01:07:17,860 | |
you do not have access to all the data in the world. Of every language in the world, | |
1206 | |
01:07:17,860 --> 01:07:23,740 | |
there is 4,000 languages or something like that that people use. All the cultures, all | |
1207 | |
01:07:23,740 --> 01:07:31,140 | |
the value systems, all the centers of interest, you just don't have all the data available. | |
1208 | |
01:07:31,140 --> 01:07:35,740 | |
So the future is one in which those systems would not be trained by a single company. | |
1209 | |
01:07:35,740 --> 01:07:40,980 | |
They will be trained in a distributed manner so that you all have big data centers in various | |
1210 | |
01:07:40,980 --> 01:07:41,980 | |
parts of the world. | |
1211 | |
01:07:41,980 --> 01:07:47,640 | |
They have access to local data, but they all contribute to training a large model that | |
1212 | |
01:07:47,640 --> 01:07:54,140 | |
will be worldwide and will eventually constitute the repository of all human knowledge. | |
1213 | |
01:07:54,140 --> 01:07:58,460 | |
This is a very lofty goal to try to attain, right? | |
1214 | |
01:07:58,460 --> 01:08:02,040 | |
Having a system that basically constitutes a repository of all human knowledge, but it's | |
1215 | |
01:08:02,040 --> 01:08:06,900 | |
a system you can talk to, you can ask questions to, it can serve as a tutor, as a professor | |
1216 | |
01:08:06,900 --> 01:08:13,140 | |
maybe, put a lot of us here at our job. | |
1217 | |
01:08:13,140 --> 01:08:15,460 | |
It's a thing that we should really work towards. | |
1218 | |
01:08:15,460 --> 01:08:21,600 | |
It will amplify human intelligence, improve rational thought perhaps. | |
1219 | |
01:08:21,600 --> 01:08:23,080 | |
But it needs to be diverse also. | |
1220 | |
01:08:28,460 --> 01:08:31,060 | |
a handful of companies on the West Coast of the US. | |
1221 | |
01:08:31,060 --> 01:08:32,260 | |
That's completely unacceptable | |
1222 | |
01:08:32,260 --> 01:08:34,060 | |
to a lot of governments in the world, | |
1223 | |
01:08:35,060 --> 01:08:37,040 | |
democratic governments, right? | |
1224 | |
01:08:37,040 --> 01:08:39,680 | |
You need a diversity of AI assistance | |
1225 | |
01:08:39,680 --> 01:08:41,420 | |
for the same reason you need a diversity | |
1226 | |
01:08:41,420 --> 01:08:44,720 | |
of newspapers, magazines and the press. | |
1227 | |
01:08:44,720 --> 01:08:47,360 | |
You need a free press with diversity. | |
1228 | |
01:08:48,380 --> 01:08:51,980 | |
And we need free AI with diversity as well. | |
1229 | |
01:08:58,460 --> 01:09:01,780 | |
in AI, some of them are worried about the dangers | |
1230 | |
01:09:01,780 --> 01:09:04,880 | |
of making AI technology available to everyone. | |
1231 | |
01:09:04,880 --> 01:09:09,100 | |
I think the benefits far outweigh the dangers and the risks. | |
1232 | |
01:09:10,040 --> 01:09:13,940 | |
In fact, I think the main risk of AI in the future | |
1233 | |
01:09:13,940 --> 01:09:17,720 | |
is will happen if AI is controlled | |
1234 | |
01:09:17,720 --> 01:09:19,900 | |
by a small number of commercial companies | |
1235 | |
01:09:19,900 --> 01:09:22,900 | |
that don't reveal how their AI systems work. | |
1236 | |
01:09:22,900 --> 01:09:24,340 | |
I think that's very dangerous. | |
1237 | |
01:09:24,340 --> 01:09:33,940 | |
So attempts to minimize the risk of AI by basically making open source AI illegal, | |
1238 | |
01:09:33,940 --> 01:09:39,700 | |
I think it completely misdirected and it will actually reach the opposite result of the intended one. | |
1239 | |
01:09:39,700 --> 01:09:42,580 | |
It will make AI less safe. | |
1240 | |
01:09:42,580 --> 01:09:50,780 | |
So open research, open source AI must not be regulated out of existence. | |
1241 | |
01:09:50,780 --> 01:09:53,460 | |
A lot of politicians need to understand this. | |
1242 | |
01:09:53,660 --> 01:09:57,700 | |
There's an alliance of various companies that are really kind of subscribed to this model, | |
1243 | |
01:09:57,700 --> 01:10:03,620 | |
Meta, IBM, Intel, Sony, a lot of people in academia, a lot of startups, venture capitalists, | |
1244 | |
01:10:03,620 --> 01:10:10,800 | |
etc. And then a few companies who are kind of advocating for the opposite. That will | |
1245 | |
01:10:10,800 --> 01:10:18,820 | |
remain nameless. So, you know, perhaps if we do it right, we'll have systems that will | |
1246 | |
01:10:18,820 --> 01:10:23,420 | |
amplify human intelligence, as I was saying at the beginning of the talk. And this may | |
1247 | |
01:10:23,420 --> 01:10:29,980 | |
Bring about a new renaissance for humanity, you know, similar to what happened with the printing press in the 15th century. | |
1248 | |
01:10:30,800 --> 01:10:35,180 | |
And on this cosmic conclusion, I will thank you very much. | |
1249 | |
01:10:47,380 --> 01:10:50,720 | |
And by the way, these are pictures I took from my backyard in New Jersey. | |
1250 | |
01:10:50,720 --> 01:10:59,040 | |
Thank you, Jan. So Jan will take a few questions now. And for people who are leaving, | |
1251 | |
01:10:59,320 --> 01:11:04,760 | |
please leave from the Broadway entrance. Do not leave from the campus entrance. But yeah, | |
1252 | |
01:11:04,760 --> 01:11:10,820 | |
questions? Please line up on the mics if you have questions. | |
1253 | |
01:11:20,720 --> 01:11:29,260 | |
No sound. | |
1254 | |
01:11:30,400 --> 01:11:31,240 | |
Yeah, it works. | |
1255 | |
01:11:38,760 --> 01:11:39,320 | |
Hi. | |
1256 | |
01:11:40,200 --> 01:11:42,260 | |
Ayaan, thank you for coming so much. | |
1257 | |
01:11:42,900 --> 01:11:46,960 | |
I wanted to ask for 3D vision models, | |
1258 | |
01:11:46,960 --> 01:11:48,600 | |
what do you see business applications | |
1259 | |
01:11:48,600 --> 01:11:50,180 | |
in the next seven, eight years? | |
1260 | |
01:11:50,180 --> 01:11:56,240 | |
Yeah, I haven't talked about 3D. | |
1261 | |
01:11:56,240 --> 01:12:01,220 | |
I mean, some of my colleagues think there is something very special about 3D. | |
1262 | |
01:12:01,220 --> 01:12:03,320 | |
I don't necessarily think that's the case. | |
1263 | |
01:12:03,320 --> 01:12:08,840 | |
I mean, we're hoping that the next generation of these VJPAD models will basically understand | |
1264 | |
01:12:08,840 --> 01:12:12,720 | |
the fact that the world is three-dimensional and there are objects in front of others and | |
1265 | |
01:12:12,720 --> 01:12:13,720 | |
things like that. | |
1266 | |
01:12:13,720 --> 01:12:19,580 | |
Now, there are applications for which you need 3D inference and reconstruction in 3D | |
1267 | |
01:12:19,580 --> 01:12:22,520 | |
If you want to have virtual objects in virtual environments | |
1268 | |
01:12:22,520 --> 01:12:24,500 | |
and things like this. | |
1269 | |
01:12:24,500 --> 01:12:26,780 | |
But frankly, I'm not a specialist. | |
1270 | |
01:12:26,780 --> 01:12:29,000 | |
I think there are specialists of that question here | |
1271 | |
01:12:29,000 --> 01:12:31,380 | |
at Columbia, actually. | |
1272 | |
01:12:31,380 --> 01:12:32,700 | |
Just one more question. | |
1273 | |
01:12:32,700 --> 01:12:37,100 | |
Do you really see that VJEPA models and DinoV2 | |
1274 | |
01:12:37,100 --> 01:12:40,080 | |
having hierarchical planning like the kind you mentioned | |
1275 | |
01:12:40,080 --> 01:12:41,240 | |
earlier? | |
1276 | |
01:12:41,240 --> 01:12:43,740 | |
So it doesn't exist yet. | |
1277 | |
01:12:43,740 --> 01:12:47,340 | |
So this is something we're working on. | |
1278 | |
01:12:47,340 --> 01:12:52,780 | |
I hope we will get some results about this for you know in the next year or two something like that. | |
1279 | |
01:12:53,580 --> 01:13:01,660 | |
Thank you so much. Okay one question here. You talked about sorry | |
1280 | |
01:13:06,220 --> 01:13:11,900 | |
you talked about the benefits of AI and you think it's more beneficial than there are risks to it | |
1281 | |
01:13:17,340 --> 01:13:25,900 | |
West Coast, control the most advanced models. So why do you feel that the benefits outweigh the risks? | |
1282 | |
01:13:25,900 --> 01:13:32,060 | |
So that's not entirely true. Meta actually does not subscribe to this model that AI should be | |
1283 | |
01:13:32,060 --> 01:13:38,620 | |
proprietary and kept in its own hands. It releases a series of models called LAMA, right? So LAMA 1, | |
1284 | |
01:13:38,620 --> 01:13:45,660 | |
2, 3, 3.1, 3.2, which are state of the art or really close to it or better in certain measures. | |
1285 | |
01:13:45,660 --> 01:13:51,900 | |
And this is open source. It can be used freely by a lot of people around the world. It can be | |
1286 | |
01:13:51,900 --> 01:14:01,740 | |
fine-tuned for various languages or vertical applications. And it's...LAMA 3 has been | |
1287 | |
01:14:01,740 --> 01:14:06,140 | |
downloaded, I think, 400 million times or something like this. It's just insane. And | |
1288 | |
01:14:06,140 --> 01:14:15,580 | |
every single company I talk to has either deployed it or is about to deploy products based on LAMA. | |
1289 | |
01:14:15,580 --> 01:14:22,580 | |
There are people in Africa who are using it and training it to provide medical assistance, for example. | |
1290 | |
01:14:22,580 --> 01:14:31,580 | |
There's people in India that Mita is collaborating with so that future versions of Lama will speak all 22 official languages of India, | |
1291 | |
01:14:31,580 --> 01:14:34,580 | |
and perhaps at some point all the 1500 dialects or whatever. | |
1292 | |
01:14:34,580 --> 01:14:40,580 | |
So, you know, I think that's the way to make AI widely accessible to everyone in the world. | |
1293 | |
01:14:40,580 --> 01:14:44,180 | |
I mean, I'm really happy to be part of that effort. | |
1294 | |
01:14:44,180 --> 01:14:47,800 | |
I really wouldn't like to be part of kind of a closed effort. | |
1295 | |
01:14:51,840 --> 01:14:52,340 | |
Hi, Yan. | |
1296 | |
01:14:52,340 --> 01:14:53,820 | |
My name is Srikant. | |
1297 | |
01:14:53,820 --> 01:14:55,560 | |
I want to ask you, I'm curious to know | |
1298 | |
01:14:55,560 --> 01:14:58,400 | |
what you think about the capabilities of time series | |
1299 | |
01:14:58,400 --> 01:15:01,460 | |
foundation models, because I see that Amazon, Google, | |
1300 | |
01:15:01,460 --> 01:15:04,400 | |
Meta, everyone's trying to work in that domain. | |
1301 | |
01:15:04,400 --> 01:15:07,100 | |
But to me, intuitively, it feels like time series predictions | |
1302 | |
01:15:07,100 --> 01:15:09,760 | |
are a harder problem than language modeling. | |
1303 | |
01:15:09,760 --> 01:15:12,860 | |
What are your thoughts on the capabilities and limitations on this? | |
1304 | |
01:15:12,860 --> 01:15:14,060 | |
Yeah, okay. | |
1305 | |
01:15:14,060 --> 01:15:18,580 | |
I think you put your finger on an important point, which I forgot to mention. | |
1306 | |
01:15:18,580 --> 01:15:24,580 | |
The reason why language modeling works, why those predictive models that predict the next | |
1307 | |
01:15:24,580 --> 01:15:27,980 | |
word, the reason why they work for natural language and they don't work for images and | |
1308 | |
01:15:27,980 --> 01:15:31,680 | |
video, for example, is because language is discrete. | |
1309 | |
01:15:31,680 --> 01:15:38,280 | |
So to represent an uncertainty in the prediction when you have a discrete choice with a few | |
1310 | |
01:15:38,280 --> 01:15:40,280 | |
It's easy. | |
1311 | |
01:15:40,280 --> 01:15:45,100 | |
You just produce a distribution, probably distribution of all the possible outcomes. | |
1312 | |
01:15:45,100 --> 01:15:46,100 | |
And this is how LLMs work. | |
1313 | |
01:15:46,100 --> 01:15:47,100 | |
They are trained. | |
1314 | |
01:15:47,100 --> 01:15:51,160 | |
They actually produce a distribution over the next token. | |
1315 | |
01:15:51,160 --> 01:15:57,300 | |
You can't do this with continuous variables, particularly high dimensional continuous variables | |
1316 | |
01:15:57,300 --> 01:16:00,180 | |
like video pixels. | |
1317 | |
01:16:00,180 --> 01:16:06,480 | |
So there, we're not able to represent distributions efficiently in high dimensional continuous | |
1318 | |
01:16:06,480 --> 01:16:10,800 | |
spaces beyond like simple ones like Gaussians, right? | |
1319 | |
01:16:10,800 --> 01:16:15,720 | |
So my answer to this is don't do it. | |
1320 | |
01:16:15,720 --> 01:16:18,100 | |
Do prediction in representation space. | |
1321 | |
01:16:18,100 --> 01:16:20,160 | |
And then if you need to have actual prediction | |
1322 | |
01:16:20,160 --> 01:16:22,540 | |
of the time series, have a decoder that does that | |
1323 | |
01:16:22,540 --> 01:16:23,100 | |
separately. | |
1324 | |
01:16:23,100 --> 01:16:26,100 | |
But actually training a system to predict | |
1325 | |
01:16:26,100 --> 01:16:28,680 | |
high dimensional continuous thing by regression | |
1326 | |
01:16:28,680 --> 01:16:32,380 | |
when you have uncertainty simply doesn't work. | |
1327 | |
01:16:32,380 --> 01:16:34,760 | |
That's the evidence we have by trying to, | |
1328 | |
01:16:34,760 --> 01:16:38,200 | |
There was a huge project at Meta called Video MAE. | |
1329 | |
01:16:38,200 --> 01:16:40,280 | |
So the idea was, you know, take a video, | |
1330 | |
01:16:40,280 --> 01:16:41,920 | |
max some parts of it, | |
1331 | |
01:16:41,920 --> 01:16:43,300 | |
and then train some gigantic neural net | |
1332 | |
01:16:43,300 --> 01:16:45,060 | |
to predict the parts that are missing. | |
1333 | |
01:16:45,060 --> 01:16:46,520 | |
It was complete failure. | |
1334 | |
01:16:46,520 --> 01:16:49,700 | |
We abandoned that project. | |
1335 | |
01:16:49,700 --> 01:16:52,440 | |
We canceled it, because it was going nowhere, okay? | |
1336 | |
01:16:52,440 --> 01:16:54,860 | |
And this was really very large scale. | |
1337 | |
01:16:54,860 --> 01:16:56,860 | |
A lot of computing resources were devoted to this. | |
1338 | |
01:16:56,860 --> 01:16:58,580 | |
It just didn't work. | |
1339 | |
01:16:58,580 --> 01:17:01,040 | |
The J-Path stuff, though, does work. | |
1340 | |
01:17:01,040 --> 01:17:03,660 | |
So my hunch is that for time series, | |
1341 | |
01:17:03,660 --> 01:17:07,900 | |
It's probably a way to use kind of similar idea. | |
1342 | |
01:17:07,900 --> 01:17:08,780 | |
SPEAKER 1 | |
1343 | |
01:17:08,780 --> 01:17:09,380 | |
OK, thank you. | |
1344 | |
01:17:12,580 --> 01:17:12,620 | |
SPEAKER 1 | |
1345 | |
01:17:12,620 --> 01:17:14,540 | |
Great talk. | |
1346 | |
01:17:14,540 --> 01:17:17,200 | |
So my question is, I think I agree with your framework | |
1347 | |
01:17:17,200 --> 01:17:18,840 | |
for you have some world model and you | |
1348 | |
01:17:18,840 --> 01:17:20,900 | |
want to optimize via that world model | |
1349 | |
01:17:20,900 --> 01:17:22,200 | |
and how you train the world model. | |
1350 | |
01:17:22,200 --> 01:17:24,980 | |
But my question is, how do you get intelligence | |
1351 | |
01:17:24,980 --> 01:17:28,720 | |
when the world model is inconsistent with the truth? | |
1352 | |
01:17:28,720 --> 01:17:31,440 | |
So as an example, let's say your world model only | |
1353 | |
01:17:31,440 --> 01:17:33,240 | |
has classical mechanics. | |
1354 | |
01:17:33,240 --> 01:17:35,460 | |
how do you discover special relativity? | |
1355 | |
01:17:35,460 --> 01:17:38,220 | |
Humans have somehow broken that boundary, | |
1356 | |
01:17:38,220 --> 01:17:39,440 | |
but I don't know how you do that | |
1357 | |
01:17:39,440 --> 01:17:42,360 | |
when your world model is only based on observed data. | |
1358 | |
01:17:43,320 --> 01:17:45,260 | |
Well, I mean, the type of world model | |
1359 | |
01:17:45,260 --> 01:17:46,840 | |
we're talking about here is, | |
1360 | |
01:17:48,120 --> 01:17:50,700 | |
what I would be happy with before I retire | |
1361 | |
01:17:50,700 --> 01:17:52,800 | |
or before my brain turns into a bitch-a-mail sauce | |
1362 | |
01:17:52,800 --> 01:17:57,800 | |
is world models that are of the level of complexity | |
1363 | |
01:17:58,020 --> 01:18:01,860 | |
of a cat's world model, right, of the physical world, | |
1364 | |
01:18:01,860 --> 01:18:03,400 | |
Which is pretty sophisticated actually. | |
1365 | |
01:18:03,400 --> 01:18:06,260 | |
I mean, you can plan really complex actions. | |
1366 | |
01:18:06,260 --> 01:18:07,360 | |
So that's what we're talking about. | |
1367 | |
01:18:07,360 --> 01:18:09,400 | |
Now, you put your finger on something | |
1368 | |
01:18:09,400 --> 01:18:11,340 | |
that's really interesting, which is that, | |
1369 | |
01:18:12,540 --> 01:18:15,940 | |
which is a philosophical motivation behind JEPA, | |
1370 | |
01:18:15,940 --> 01:18:20,140 | |
and this idea that you need to lift the abstraction level | |
1371 | |
01:18:21,580 --> 01:18:23,340 | |
to be able to make predictions, right? | |
1372 | |
01:18:24,540 --> 01:18:27,420 | |
You cannot make predictions at the level of observation. | |
1373 | |
01:18:27,420 --> 01:18:31,620 | |
You have to find a good representation of reality | |
1374 | |
01:18:31,620 --> 01:18:33,260 | |
within which you can make predictions. | |
1375 | |
01:18:33,260 --> 01:18:35,800 | |
And that's the hardest problem really, | |
1376 | |
01:18:35,800 --> 01:18:37,620 | |
is to find that good representation space | |
1377 | |
01:18:37,620 --> 01:18:38,940 | |
that allows you to make predictions. | |
1378 | |
01:18:38,940 --> 01:18:40,480 | |
We do this all the time in science. | |
1379 | |
01:18:40,480 --> 01:18:42,900 | |
We do this all the time in everyday life without realizing, | |
1380 | |
01:18:42,900 --> 01:18:45,160 | |
but we do this all the time in science. | |
1381 | |
01:18:46,620 --> 01:18:47,760 | |
If we didn't need to do this, | |
1382 | |
01:18:47,760 --> 01:18:52,760 | |
we could explain human society with quantum field theory. | |
1383 | |
01:18:54,180 --> 01:18:55,020 | |
Right? | |
1384 | |
01:18:55,020 --> 01:18:55,860 | |
Right. | |
1385 | |
01:18:55,860 --> 01:18:56,880 | |
But we can't, right? | |
1386 | |
01:18:56,880 --> 01:19:00,340 | |
Because the gap, you know, in abstraction is so large, right? | |
1387 | |
01:19:00,340 --> 01:19:05,380 | |
So we go from quantum field theory to particle physics and from particles to atoms and from | |
1388 | |
01:19:05,380 --> 01:19:11,060 | |
atoms to molecules, from molecules to materials and from chemistry and you know blah blah blah | |
1389 | |
01:19:11,060 --> 01:19:17,780 | |
right and we go up the chain of abstraction so that at some level we have a representation | |
1390 | |
01:19:17,780 --> 01:19:24,340 | |
of physical objects and Newtonian mechanics and for you know large scale it would be relativity | |
1391 | |
01:19:30,340 --> 01:19:34,880 | |
human behavior, animal behavior, ecology, you know, this kind of stuff, right? | |
1392 | |
01:19:34,880 --> 01:19:38,340 | |
So we have all those levels of representation for which we have the, | |
1393 | |
01:19:38,340 --> 01:19:42,400 | |
for which the crucial insight is to actually find a representation. | |
1394 | |
01:19:42,400 --> 01:19:45,780 | |
For example, let's take a planet. Let's take Jupiter, okay? | |
1395 | |
01:19:45,780 --> 01:19:47,700 | |
Jupiter is an incredibly complex object. | |
1396 | |
01:19:47,700 --> 01:19:52,080 | |
It's got, you know, complicated composition. | |
1397 | |
01:19:52,080 --> 01:19:55,480 | |
It's got weather. It's got all kinds of gases swirling around. | |
1398 | |
01:19:55,480 --> 01:20:00,480 | |
And, you know, very complex object, right? | |
1399 | |
01:20:02,180 --> 01:20:05,840 | |
Now, who would have thought that the only thing you need | |
1400 | |
01:20:05,840 --> 01:20:10,420 | |
to predict the trajectory of Jupiter is six numbers? | |
1401 | |
01:20:10,420 --> 01:20:13,300 | |
You need three position, three velocities, | |
1402 | |
01:20:13,300 --> 01:20:16,480 | |
and you can predict the trajectory of Jupiter for centuries. | |
1403 | |
01:20:18,460 --> 01:20:19,920 | |
You know, that's a problem of learning | |
1404 | |
01:20:19,920 --> 01:20:22,140 | |
a good representation, right? | |
1405 | |
01:20:22,140 --> 01:20:23,740 | |
So, is the proposal essentially | |
1406 | |
01:20:23,740 --> 01:20:26,500 | |
to do this hierarchical planning with hierarchical world | |
1407 | |
01:20:26,500 --> 01:20:27,500 | |
models as well? | |
1408 | |
01:20:27,500 --> 01:20:28,000 | |
Yeah. | |
1409 | |
01:20:28,000 --> 01:20:28,500 | |
OK. | |
1410 | |
01:20:28,500 --> 01:20:29,000 | |
Exactly. | |
1411 | |
01:20:29,000 --> 01:20:29,500 | |
Awesome. | |
1412 | |
01:20:29,500 --> 01:20:31,900 | |
Have a system that can build multiple levels of abstractions. | |
1413 | |
01:20:31,900 --> 01:20:32,480 | |
Great. | |
1414 | |
01:20:32,480 --> 01:20:32,940 | |
Thanks. | |
1415 | |
01:20:32,940 --> 01:20:36,040 | |
Which is really the idea behind deep learning, by the way. | |
1416 | |
01:20:36,040 --> 01:20:36,540 | |
OK. | |
1417 | |
01:20:36,540 --> 01:20:38,120 | |
We'll have two more questions, then we'll stop. | |
1418 | |
01:20:38,120 --> 01:20:40,600 | |
So we'll take one from there and one from there. | |
1419 | |
01:20:40,600 --> 01:20:40,880 | |
Yeah. | |
1420 | |
01:20:40,880 --> 01:20:41,700 | |
Hi. | |
1421 | |
01:20:41,700 --> 01:20:45,640 | |
My question is about the one type of generative model | |
1422 | |
01:20:45,640 --> 01:20:49,820 | |
that you haven't covered, which is the diffusion models, which | |
1423 | |
01:20:49,820 --> 01:20:56,560 | |
I believe are quite different from the generative models | |
1424 | |
01:20:56,560 --> 01:21:00,200 | |
that you mentioned, because they are more implicit and | |
1425 | |
01:21:00,200 --> 01:21:04,360 | |
they don't predict the explicit probability distribution | |
1426 | |
01:21:04,360 --> 01:21:09,180 | |
like the LMS or VAEs or all the other generative one that you | |
1427 | |
01:21:09,180 --> 01:21:13,880 | |
mentioned. What are your perspective on the potential of | |
1428 | |
01:21:13,880 --> 01:21:19,820 | |
those models and especially with it has some attribute | |
1429 | |
01:21:19,820 --> 01:21:26,540 | |
to hierarchical planning as you said because when you use it for generating image, like | |
1430 | |
01:21:26,540 --> 01:21:32,580 | |
in the first few time steps, it actually generates like very high level details and then on the | |
1431 | |
01:21:32,580 --> 01:21:37,060 | |
later time step, it fills in the details, like the smaller details. | |
1432 | |
01:21:37,060 --> 01:21:38,060 | |
Yeah. | |
1433 | |
01:21:38,060 --> 01:21:39,060 | |
Okay. | |
1434 | |
01:21:39,060 --> 01:21:41,480 | |
So diffusion models can be seen as generative or not. | |
1435 | |
01:21:41,480 --> 01:21:45,840 | |
But the way to understand them, I think, is the following. | |
1436 | |
01:21:45,840 --> 01:21:53,240 | |
In a space of representation or images or whatever it is, you have, let's say, a manifold | |
1437 | |
01:21:53,240 --> 01:21:56,080 | |
of data. | |
1438 | |
01:21:56,080 --> 01:22:00,420 | |
Let's say natural images if you want to train an image generation system. | |
1439 | |
01:22:00,420 --> 01:22:04,860 | |
Or perhaps representations that are extracted by an encoder of the type that I talked about. | |
1440 | |
01:22:04,860 --> 01:22:10,800 | |
And those basically is a subset within the full space. | |
1441 | |
01:22:10,800 --> 01:22:14,660 | |
What a diffusion model does is that you give it a random vector in that space and it will | |
1442 | |
01:22:14,660 --> 01:22:16,980 | |
bring you back to that manifold. | |
1443 | |
01:22:16,980 --> 01:22:21,800 | |
Okay, and it will do this by training a vector field | |
1444 | |
01:22:21,800 --> 01:22:26,300 | |
so that at every location, random location in that space, | |
1445 | |
01:22:26,300 --> 01:22:29,560 | |
there is a vector that basically takes you back | |
1446 | |
01:22:29,560 --> 01:22:32,900 | |
to that manifold, perhaps in multiple steps. | |
1447 | |
01:22:32,900 --> 01:22:34,760 | |
Okay, that's what it does in the end. | |
1448 | |
01:22:34,760 --> 01:22:36,860 | |
It's trained in a particular way by reversing, | |
1449 | |
01:22:38,960 --> 01:22:43,620 | |
you know, a noisification chain, but that's what it does. | |
1450 | |
01:22:43,620 --> 01:22:49,740 | |
Now that's actually a particular way of implementing energy-based models of the types that I described. | |
1451 | |
01:22:49,740 --> 01:22:53,520 | |
Because you can think of this manifold of data as being kind of the minimum of an energy | |
1452 | |
01:22:53,520 --> 01:22:54,600 | |
function. | |
1453 | |
01:22:54,600 --> 01:22:58,740 | |
And if you had an energy function, you could compute the gradient of that energy function, | |
1454 | |
01:22:58,740 --> 01:23:02,900 | |
that gradient of the energy function will take you back to that manifold. | |
1455 | |
01:23:02,900 --> 01:23:09,020 | |
So that's the energy-based view of inference or denoising or restoration or whatever you | |
1456 | |
01:23:09,020 --> 01:23:11,760 | |
want. | |
1457 | |
01:23:11,760 --> 01:23:17,760 | |
And diffusion models basically instead of having an energy function that you compute | |
1458 | |
01:23:17,760 --> 01:23:22,440 | |
the gradient of, they directly learn the vector field that basically would be the gradient | |
1459 | |
01:23:22,440 --> 01:23:24,440 | |
of that energy function. | |
1460 | |
01:23:24,440 --> 01:23:25,620 | |
That's the way to understand it. | |
1461 | |
01:23:25,620 --> 01:23:27,880 | |
So it's not disconnected from what I talked about. | |
1462 | |
01:23:27,880 --> 01:23:32,160 | |
It can be used usefully in the context of what I talked about. | |
1463 | |
01:23:32,160 --> 01:23:34,520 | |
And what about nature? | |
1464 | |
01:23:34,520 --> 01:23:35,520 | |
Yeah. | |
1465 | |
01:23:35,520 --> 01:23:38,520 | |
My name is Leon. | |
1466 | |
01:23:38,520 --> 01:23:40,960 | |
I really want to thank you for the talk. | |
1467 | |
01:23:40,960 --> 01:23:44,960 | |
My question was sort of about these world models you were talking about, | |
1468 | |
01:23:44,960 --> 01:23:50,960 | |
especially in terms of trying to get to actual like cat level or animal type intelligence. | |
1469 | |
01:23:50,960 --> 01:24:01,960 | |
So like in terms of like a giraffe, as soon as it's born, something is in its mind that lets it be able to run or even walk within moments. | |
1470 | |
01:24:01,960 --> 01:24:07,960 | |
And I think part of it is because the world model it has constrains the type of actions it takes, | |
1471 | |
01:24:07,960 --> 01:24:14,460 | |
That kind of thing seems to be what you're almost doing with the dyno of trying to do these rule based approaches. | |
1472 | |
01:24:15,000 --> 01:24:18,900 | |
I'm just wondering how do these world models evolve over time? | |
1473 | |
01:24:19,040 --> 01:24:22,680 | |
Like how much variability does it have? | |
1474 | |
01:24:22,680 --> 01:24:30,000 | |
Yeah, I mean so clearly you need the world model to be adjusted as you go, right? | |
1475 | |
01:24:37,960 --> 01:24:42,200 | |
particular force to grab it, but then as I grab it, I realize it's not that full, so | |
1476 | |
01:24:42,200 --> 01:24:43,640 | |
it's lighter. | |
1477 | |
01:24:43,640 --> 01:24:49,580 | |
I can adjust my role model of that system and then adjust my actions as a function of | |
1478 | |
01:24:49,580 --> 01:24:50,580 | |
this very quickly. | |
1479 | |
01:24:50,580 --> 01:24:51,580 | |
It's not learning, actually. | |
1480 | |
01:24:51,580 --> 01:24:53,560 | |
It's just a few parameter adjustment. | |
1481 | |
01:24:53,560 --> 01:24:57,060 | |
But in other situations, you need to learn. | |
1482 | |
01:24:57,060 --> 01:25:01,740 | |
You need to adapt your role model for the situation. | |
1483 | |
01:25:01,740 --> 01:25:06,600 | |
If you have a powerful role model, you're not going to be able to train it for all possible | |
1484 | |
01:25:06,600 --> 01:25:10,780 | |
situations and all possible configurations of the world. | |
1485 | |
01:25:10,780 --> 01:25:14,920 | |
And so there are parts of the state space | |
1486 | |
01:25:14,920 --> 01:25:18,120 | |
that where your model is gonna be inaccurate. | |
1487 | |
01:25:18,120 --> 01:25:20,780 | |
And the system, if you want the system to plan accurately, | |
1488 | |
01:25:20,780 --> 01:25:23,580 | |
it needs to be able to detect when that happens. | |
1489 | |
01:25:23,580 --> 01:25:26,940 | |
So basically only plan within regions of the space | |
1490 | |
01:25:26,940 --> 01:25:29,720 | |
where the prediction of its own model is good, | |
1491 | |
01:25:29,720 --> 01:25:31,660 | |
and then adjust its model as it goes | |
1492 | |
01:25:31,660 --> 01:25:33,800 | |
if it's not the case. | |
1493 | |
01:25:33,800 --> 01:25:36,400 | |
That's where you need reinforcement learning basically. | |
1494 | |
01:25:36,400 --> 01:25:38,500 | |
Can I just ask a clarification question? | |
1495 | |
01:25:38,900 --> 01:25:43,640 | |
I think there's a lot of understanding of I'm really confident in what I'm able to do, | |
1496 | |
01:25:43,900 --> 01:25:49,640 | |
but as soon as, let's say, I throw a ball, the physics of that ball is something really unpredictable. | |
1497 | |
01:25:50,100 --> 01:25:52,280 | |
How would you differentiate that in your world model? | |
1498 | |
01:25:52,520 --> 01:25:53,340 | |
Are there parameters? | |
1499 | |
01:25:53,520 --> 01:25:57,100 | |
Yeah, so this is adaptation on the fly of your world model | |
1500 | |
01:25:57,100 --> 01:26:01,080 | |
or perhaps adjustment of a few latent variables that represent what you don't know about the world, | |
1501 | |
01:26:01,200 --> 01:26:02,960 | |
like the wind speed and things like that. | |
1502 | |
01:26:02,960 --> 01:26:06,500 | |
So, I mean, there's various mechanisms for this. | |
1503 | |
01:26:07,780 --> 01:26:09,600 | |
Okay, let's thank the speaker again. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Lecture Series in AI: “How Could Machines Reach Human-Level Intelligence?” by Yann LeCun
https://www.youtube.com/watch?v=xL6Y0dpXEwc
lecun-20241018-columbia.pdf
https://drive.google.com/file/d/1F0Q8Fq0h2pHq9j6QIbzqhBCfTXJ7Vmf4/view?fbclid=IwY2xjawGHh3tleHRuA2FlbQIxMAABHcLjFAQz00GHBdJ-pTMzs8UkGiykjlbQ4wurigytkf3nrf2ROdZ0c7GQXA_aem_M9Td1Gud4fVupEGYTw57Hw
whisper-large-v3-turbo generated transcript