Skip to content

Instantly share code, notes, and snippets.

@ricklamers
Created September 3, 2025 12:45
Show Gist options
  • Save ricklamers/3f4850589059c8b412f56795569bff00 to your computer and use it in GitHub Desktop.
Save ricklamers/3f4850589059c8b412f56795569bff00 to your computer and use it in GitHub Desktop.
[00:00.000]
Ten years ago, I visited Singapore office.
[00:04.000]
Actually, exactly ten years ago, I visited Google Singapore office.
[00:10.000]
Had a really great lunch.
[00:12.000]
I said, you know, the lunch is so great, I have to come back.
[00:20.000]
So, I'm preparing for this talk, I said, I need to do something special.
[00:23.000]
And this is for the first time that I give a talk ever that I have an emoji in my sight.
[00:30.000]
So you're the lucky ones, yeah?
[00:34.080]
So
[00:36.840]
Last December, Ilya, Oreo and I won the Test of Time award at NeurIPS for our work on SQL.
[00:45.380]
Woo!
[00:50.380]
So many people came to me and asked, you know, what is the story behind this?
[00:55.140]
What is the story behind some of the work around SQL to SQL?
[00:58.720]
so I just wanna I just wanna give you maybe like a background behind it and
[01:05.020]
then some of the chaotic history behind some of the work that done you know like at Google at Google Brain you know under the leadership of Jeff you know he gave an amazing talk today and just to tell you
[01:24.508]
why we did what we did, and then how this innovation came out, some of these accidents.
[01:29.848]
So in my journey in the last decade, I would say broken into three phases.
[01:43.048]
The first phase is the discovery of pre-training.
[01:46.728]
I'm going to talk a little bit about that.
[01:50.788]
And then the second phase is the discovery of scaling and then also the discovery of
[01:59.448]
instruction tuning and how to use LLM for program synthesis.
[02:03.348]
It's one of the key applications that people use language models today.
[02:09.088]
And then the third phase is something around reasoning, the reasoning paradigm and the
[02:14.468]
inference scaling laws.
[02:15.468]
So I'm going to touch some of these ideas in this talk and why we went for some of these ideas.
[02:26.456]
So, where did it all begin?
[02:31.456]
When I started in Brain, I was doing this cat neuron project.
[02:35.456]
The cat neuron project was to scale the neural network to receive,
[02:40.456]
and then to book the network, and then one of the neurons can activate for cats.
[02:46.456]
But after that, a lot of people started working on computer vision,
[02:52.456]
neural nets for computer vision.
[02:55.216]
But then I became very interested in the idea of neural
[02:58.596]
nets for language understanding.
[03:00.836]
Because I believe that if you can get neural networks to
[03:03.736]
understand language, it will basically unlock the real
[03:08.396]
potential of neural nets for AI.
[03:15.336]
So back then, there was this amazing work called Word to
[03:19.616]
back by Thomas Mikulov, Jeff, Greg, and other people in the team.
[03:25.136]
So I was playing around with it.
[03:26.816]
And then what I observed, um, was that if you train these words to back on,
[03:34.396]
uh English and then you also train these words to back on Spanish and then you just projected them into 2D you will see something like this you will see that
[03:45.624]
the world vectors in English you say one two three four five if you project the
[03:54.064]
world vectors in Spanish you will see actually it arranged in a very similar
[03:59.964]
shape oh and that's that's interesting so that it actually capture structures
[04:07.524]
of the language so I say if this is possible if this is what we are we can
[04:15.564]
see by projecting so how about we learn a rotation matrix between English and
[04:23.364]
Spanish right so you know some some dictionary look up right so you can
[04:28.164]
and basically translate the first,
[04:31.624]
you can learn a rotation matrix, let's say.
[04:35.364]
And it actually turned out it worked well.
[04:37.684]
So it can actually learn to translate words,
[04:41.524]
you know, rare words very, very quickly and very easily.
[04:45.464]
So I say, okay, now that you can translate,
[04:49.444]
so this paper, we wrote this paper with Thomas and
[04:52.912]
Thomas and Ilya. So after this project I said but translation is not just single
[05:00.612]
word it should be phrases or actually it should be sentences or maybe paragraphs
[05:06.192]
right so how about let's try to figure out to do this rotation matrix in some
[05:13.612]
sense rotation matrix between English like a sentence in English and then a
[05:19.772]
sentence in Spanish let's say let's say let's do that it does not is actually we
[05:25.992]
had many designs and a lot of them failed but they're actually the simplest
[05:30.012]
design turn out to work and the design look like this that the idea is you
[05:35.952]
somehow it actually by the way on the high side is actually a wrong idea but
[05:40.672]
the idea is this you take ABC and then you compress it into a vector and then
[05:46.172]
you learn a rotation matrix in this inside the model and that rotation
[05:51.752]
matrix somehow would be coded in sub out to words in another language and that's
[06:00.332]
actually the foundation of this idea or sequence to sequence and why is it wrong idea Because actually when we did this project a lot of people thought that the idea
[06:12.020]
that you compressed all the knowledge into that,
[06:16.000]
jam all the knowledge into that vector right over there,
[06:19.180]
is actually a bad idea.
[06:20.600]
And they were correct about it.
[06:23.720]
But we were basically, we said,
[06:25.040]
we don't have any alternatives.
[06:26.960]
The only option that we have here
[06:31.240]
is to scale up, try to train like a giant network.
[06:33.700]
Maybe that hidden vector right there
[06:37.440]
has to be like a billion, right?
[06:39.880]
Where we say we have to be like a billion dimension
[06:42.140]
after long and it will remember everything.
[06:44.300]
So that was a foolish idea.
[06:46.220]
Actually, the better way to do it today
[06:48.680]
is to do this through attention, right?
[06:51.160]
Like the transformer.
[06:53.080]
But back then, that was the belief, right?
[06:55.820]
But I thought that the consequence of this idea
[07:00.140]
is very important is actually,
[07:04.440]
before this idea, most people think about
[07:06.980]
machine learning or supervised learning
[07:09.100]
is just to decode only one output, right?
[07:11.520]
You see an image, you decode it, cat, right?
[07:14.400]
You see an image, you decode it, dog.
[07:16.700]
But most problem in AI, you have to decode it into a structure
[07:19.368]
into a structure. Let's say you recall it in a graph, recall it into a sentence, a paragraph.
[07:26.168]
So what we believe is all these problems can be mapped into token in and token out.
[07:34.008]
So actually we wrote it in the paper what we believe is the future is going to look like
[07:39.288]
token in and token out. And we would say that token in would be token in let's say images
[07:46.088]
and token alchemy images too.
[07:48.088]
And that's a really bold idea.
[07:50.788]
But now, on the high side, that was actually correct.
[07:53.788]
Nowadays, people are actually doing this.
[07:56.088]
They're doing image multi-model and language model
[08:01.188]
and so on in the same model, which is pretty cool.
[08:06.088]
So I was very interested in chatbots,
[08:11.588]
even when I was in high school.
[08:13.188]
So I was building my own chatbots and so on, and some Ruby lookup table to build a chatbot.
[08:26.188]
But then based on this sequence idea our colleague Orio actually Orio was the first to train this network And the way that he did it he went to a database of internal desktop data So
[08:38.276]
internal, if we have any technical issues at Google, you will go to a website and
[08:43.336]
then you chat with someone behind the scene. And then we collected that data and
[08:47.716]
stored it in the database and Oreo actually trained sequence to
[08:52.396]
sequence on that database and then the next thing we did it was we chat with it
[08:58.696]
and they would say describe your problem right and you know when you come into
[09:03.716]
the box you say I have any issues with accessing VPN and then the machine was
[09:09.356]
a high you say hello and the machine was a could you please let me know the
[09:14.896]
problems and so on right and you know the operating system and so on so
[09:19.056]
So throughout this interaction, the machine was able to identify the problem and point
[09:24.656]
you correctly to this URL right here.
[09:29.796]
And actually, we redacted the URL, but internally, when we played with it, the URL was actually
[09:36.356]
the accurate link that you can click on and then go to that and then solve the problem
[09:40.676]
for you.
[09:41.676]
Which I think it's back then at the time was kind of mind blowing experience to observe
[09:45.824]
to observe firsthand this is a network that you feel like pattern matching
[09:50.924]
because most of the time when I think about AI back at the day this is 2014
[09:56.504]
we think about 2014 AI is like pattern matching but this thing was able to
[10:02.744]
reason through all this chat and then figure out point B point you to a
[10:07.784]
a link to fix your problem, which is pretty cool.
[10:13.544]
So another important line of work that I was doing
[10:17.464]
back at the time as well is the idea that
[10:21.604]
can we do, now that these neural networks
[10:25.244]
have the potential to solve all supervised learning problems,
[10:29.144]
can you solve unsupervised learning too?
[10:32.524]
And the idea was actually very simple.
[10:34.524]
And this paper written with Andrew Dai.
[10:39.184]
And the idea is that, I highlighted it here,
[10:41.964]
in this paper we actually show two ideas.
[10:44.344]
One idea is basically using the language model
[10:48.844]
as a free-training representation.
[10:52.124]
And then another idea is to do sequence autoencoder where you have an input sentence and then trying to get the model to decode exactly that input sentence And this paper somehow actually in the paper we proved that this method can be used at
[11:10.852]
a pre-trained representation and then you can fine tune it on sentiment analysis and
[11:15.152]
so on and the result is much better than the state of the art.
[11:18.472]
But unfortunately this idea actually got lost in millions of papers during that time because
[11:23.872]
During that year, everybody was using GAN.
[11:29.992]
So nobody was actually thinking about learning representation anymore,
[11:33.992]
using generative AI to learn representations.
[11:37.432]
Most of it was just interested in image generations and so on.
[11:42.552]
But this idea, I think, is actually foundational to...
[11:46.272]
I would argue, foundational to many of today's
[11:50.872]
you know, pre-trained language models like Chak-GPT and Gemini and so on.
[11:57.172]
Because I think after this paper, I would argue after this paper, there was a GPT-Hero paper at OpenAI
[12:04.972]
that talked about the idea of using this very much the same idea, but they discovered a sentiment neuron in the network.
[12:12.280]
in the network.
[12:15.580]
So you could just follow the same training,
[12:17.400]
and then you can poke in one of the neurons,
[12:21.520]
and then one of the neurons in the network
[12:23.320]
actually was able to discover positive or negative sentiment
[12:28.600]
from reviews.
[12:32.260]
So after this work, I started to develop this belief
[12:37.000]
that compression will give rise to intelligence.
[12:41.260]
when you compress, after you can actually compress your training data and compress the
[12:45.520]
internet, you started to, the model started to understand.
[12:51.060]
And so that's basically, and you know, around in 2017, you know, some of our colleagues
[12:59.020]
at Google Brain developed this transformer and actually solved some of the problems in
[13:06.920]
the sequence to sequence where there's a hidden representation where you jack all the information
[13:16.160]
in it But in parallel I was also very interested in the idea of scaling Can you scale these neural networks with sparsity And you know I work with Jeff Norm and other people Jeff Hinton to work on the
[13:35.168]
idea of mixture of experts. So once, my belief that once we can do compression,
[13:42.968]
the next thing we should do is compression on the Internet. And if you
[13:46.268]
can compress the internet you need a really huge network if you need if you
[13:51.508]
need a new network you need to do a mix of experts and actually in this paper we
[13:59.948]
also discover like Jeff showed earlier today the idea of scaling law you know
[14:06.008]
like if you look at what this figure on the right here the x-axis is the
[14:11.148]
computational budget and the y-axis perplexity,
[14:14.968]
which basically a form of scaling law.
[14:17.468]
And actually this is the paper that describes scaling law
[14:19.928]
before the scaling laws was invented.
[14:24.308]
And then, so once I realized that we can compress
[14:28.728]
the internet, then I would say,
[14:30.548]
maybe it actually have common sense in it
[14:33.028]
because there's a lot of information on the internet.
[14:35.668]
So it must know common sense.
[14:38.736]
So one idea that I was interested in is this one other benchmark that I was interested in,
[14:45.216]
this benchmark called the Winograd Schema. In the Winograd Schema, the idea is that you have this
[14:54.756]
very similarly easy example like the trophy doesn't fit the suitcase because it is too big.
[15:02.956]
and then what is to be right so you you supposed to resolve you know what is it
[15:09.916]
in that sentence so there are two possible choices one is the trophy and
[15:16.636]
the other is the suitcase right because and you know and most of the AI back in
[15:24.436]
the day this is about 2018 or some most of the air back in the day cannot solve
[15:29.176]
this problem but but humans actually we can easily say it here refers to the
[15:37.996]
trophy and what we did was actually very very simple we actually took the
[15:46.856]
neural net took the largest language model we trained at google and then we substitute um we make two substitutions one is the trophy and the other is the suitcase so you replace it
[16:02.584]
with the trophy and it with the suitcase and then you just use the last language model to score the
[16:10.184]
probability of the sentence with the trophy and the sentence with the suitcase.
[16:16.464]
And then you would say which one has a higher probability.
[16:19.984]
Is this higher or is this higher?
[16:22.264]
And in this case, the Lash Language Model somehow actually knows, inside the model,
[16:26.684]
knows that this has higher probability than that.
[16:29.544]
So this achieved the state of the art of common sense reasoning in 2018.
[16:36.664]
This is a bi-symptom technique, just basically using
[16:39.304]
Lash language model to score the substitution.
[16:46.864]
So with all the knowledge that I had back in the day,
[16:50.944]
and by the way, going forward, for us, it's very funky.
[16:54.384]
We basically go back in time, we see, oh,
[16:58.004]
sequence to sequence can solve a lot of supervised
[17:01.704]
learning problem, you can talk to it,
[17:05.192]
can have a lot of common sense.
[17:08.612]
We also have mixture of experts.
[17:10.432]
So maybe we should start scaling, right?
[17:12.912]
It seems about the right time to scale,
[17:15.512]
you know, train a bigger model.
[17:16.852]
Because if we hang out with Jeff, he would say,
[17:18.992]
train a bigger model, order more data,
[17:21.372]
and then you have much better results.
[17:23.352]
And we see this trend actually throughout,
[17:25.852]
like in all the, you know, all the research that we do.
[17:28.712]
So maybe it's time to scale.
[17:30.392]
So, now enter the second phase of my last 10 years, it's time scale.
[17:39.192]
So I was, one project that we did, we work on is this project called Mina, right?
[17:47.052]
And the idea was we train the language model on all the internet data, including redlit.
[17:55.952]
And then after that, you can just talk to it, right?
[17:59.512]
And we developed this metric called SSA, which is basically how nice the bot is during the interaction.
[18:09.512]
And what we found is that as we train a larger and larger model the bot is getting better and better in how it interacts with us So the lower the complexity the better the interaction So basically a better model
[18:25.600]
In other words, a better model gives us a better interaction.
[18:29.420]
And it would do these things that is actually quite incredible.
[18:32.400]
So this is one of the dialogues that we found when we train, when we play with the model.
[18:38.780]
So in this example here, we say, the human said, I like how cute cows are.
[18:48.320]
And then Mina would say, ah, interesting.
[18:51.620]
You like how they are cute, right?
[18:54.420]
And then the human would say, yeah, they are really smart.
[18:57.360]
And then Mina would say, I heard they go to college.
[19:00.100]
And then human said, cows go to college.
[19:02.400]
And then Mina said, I heard that a cow went to Harvard.
[19:08.780]
And then humans say, what did the cow study?
[19:11.980]
It's like bovine senses.
[19:13.420]
So how about, do horses go to Harvard?
[19:16.160]
And then Amina would say, horses go to Harvard.
[19:20.300]
So it was able to actually just, you know,
[19:22.660]
like through all these internet interaction,
[19:26.700]
was able to invent like a joke.
[19:30.160]
And you would say a joke is,
[19:31.648]
is, well you would say joke is actually quite hard to invent, but I got somehow and we were
[19:40.788]
looking into this on Reddit data and so on, we never found any example that looked like
[19:46.448]
this. So I think it's genuine creativity actually is possible even through compression, which
[19:53.888]
which I found very interesting.
[19:57.728]
And then after that, we scaled it even further.
[20:02.128]
So you heard about this project called Lambda, the famous chatbot that Google never released.
[20:12.508]
So the story behind this is that we trained a 170D dialog model around 2020.
[20:22.108]
And we actually put up a demo internally at Google,
[20:28.748]
like a link internally at Google.
[20:31.188]
And in 2020 is pandemic, pandemic time.
[20:35.008]
And actually we just put up a demo and then it went viral.
[20:40.748]
People share the link to each other and they play with it and the server will crash at lunchtime because at lunchtime people just you know when they
[20:51.056]
working from home so they would you know they eat lunch and then they will play
[20:55.856]
with a with a bar I think mostly play with the bar no lunch and actually you
[21:04.556]
know like it's so amazing like if you look at in many innovations in history
[21:09.376]
like you know airplane telephone they always see two or three groups actually
[21:15.376]
did same thing right actually it happened like this as well there's a
[21:20.236]
parallel between us and you know some of the development at OpenAI and around
[21:24.796]
the same time GPT-3 was published it's also based on the idea of scaling
[21:29.596]
language models and during the time actually thanks to the release of the
[21:35.556]
about internally, we learned a lot.
[21:38.216]
We learned about concerns about safety and hallucination
[21:41.756]
and so on.
[21:43.456]
And for that reason, we started to collect data
[21:47.016]
to find the model.
[21:49.316]
And we find the model so that it doesn't,
[21:52.736]
you know, like the interaction is nicer and so on.
[21:55.636]
And we found out that just a small amount of findings
[21:58.104]
the fine-tuning data, it was able to improve safety, crowdedness, and quality
[22:04.704]
significantly. And for that reason we realized an idea, and nowadays people call
[22:13.824]
instruction fine-tuning. It's just by an accident that we release this bot, a lot
[22:19.484]
of people complain about it, and then we say okay, now let's make less
[22:23.304]
a few people complain so let's fine tune it so that and so and then for that
[22:31.764]
reason we did this project called FLAN and which is also a discovery of
[22:40.224]
instruction tuning so basically we found that if you do instruction fine tuning
[22:47.244]
the model was able to increase significantly in a lot of our benchmark
[22:51.424]
in terms of your performance the model start to you know reply to your
[22:56.464]
instructions and so on much better than before now one thing that we missed we
[23:01.104]
didn't do it with reinforcement learning which it was a critical idea but
[23:06.504]
actually we were all the almost in the ballpark of how how we can get large language models to follow your instructions right Because most of the time you go to the chat bot
[23:17.872]
and you say, hey, can you solve this homework for me?
[23:20.352]
And it would say, yeah, sure.
[23:21.512]
And then you ask it again and then it would say,
[23:23.212]
oh, I already sent the homework to you.
[23:25.812]
So it behaves like internet data.
[23:29.052]
It really doesn't solve your problem.
[23:31.792]
So we know that we have to fix it with fine-tuning.
[23:34.352]
And that's the beginning of what now people call it
[23:37.592]
the SFT stage of language model fine-tuning.
[23:44.472]
And then during that time, we released the chatbot
[23:47.432]
and a lot of people actually played with it
[23:49.432]
and they say, hey, by the way,
[23:50.592]
the bot can actually do program synthesis.
[23:54.172]
So, and for that reason, in around 2021,
[23:57.752]
we published this paper, but we did this project around 2020.
[24:01.652]
We have someone in the audience.
[24:06.552]
So Martin was actually involved in this project.
[24:10.352]
And he collected some data set for NBPB, which is internally
[24:15.712]
we call it Martin's basic Python programs or something
[24:20.792]
like that.
[24:22.112]
But externally, we call it Martin's, I'm sorry,
[24:24.560]
sorry, mostly basic Python programs.
[24:30.140]
Basically, mostly basic programs, right?
[24:32.160]
Like for example, in this example, I show you,
[24:36.320]
you know, Python, asking Python,
[24:39.140]
the model to give it the merge sort, right?
[24:42.600]
One of the example.
[24:43.820]
And the model was able to do this amazing program.
[24:48.520]
A lot of this is just basically pattern matching,
[24:50.880]
but it's just to show the model was able to,
[24:53.920]
be able to recall and recall precisely what program should output.
[25:01.580]
So again, also almost exactly the same time with a GPT codex paper as well.
[25:11.800]
So there's a lot of parallel in development between these companies.
[25:19.920]
So what I learned in the first five six years after we started doing language model is that pre actually encodes a lot of knowledge It can encode a lot of common sense
[25:43.448]
it can encode program synthesis and so on and so on.
[25:47.208]
So that's the first lesson.
[25:48.368]
The second lesson is that instruction time fine tuning,
[25:52.648]
it was like an accidental development,
[25:56.008]
but turned out to be really useful.
[25:58.368]
It can actually increase usability.
[26:01.688]
And I didn't know that later people call it post-training.
[26:06.368]
We really thought this is just like a teeny stage
[26:09.448]
where we can just make the model easy to use.
[26:13.288]
But by accident, we also invented post-training too.
[26:17.708]
So what the question is, what's next?
[26:19.568]
Right, like, and then basically the chapter,
[26:25.048]
you know, with Danny and some of his friends,
[26:28.008]
colleagues working on the reasoning paradigm.
[26:30.888]
So the story behind this is a little bit complicated.
[26:34.928]
But the first story was that I started looking
[26:38.148]
into language model and ask,
[26:40.568]
what can a language model not do?
[26:45.468]
And then we said, actually, maybe it cannot do geometry.
[26:51.016]
do geometry. Geometry should be something that a language model cannot do.
[26:57.156]
Because you know you have to look at the figure, like, and tokenize it. Not so
[27:02.296]
easy, not easy to tokenize it. So it should be like the corner case of
[27:08.976]
something a large language model cannot do. And that's the reason we embark on a
[27:14.416]
journey with alpha geometry. So we work on alpha geometry. I'm gonna get
[27:19.276]
back to that in a moment so we work on alpha geometry and uh you know like that's his own
[27:24.716]
journey so we were working on a specialized version of ai to just solve geometry but by accident
[27:33.116]
uh you know like i also involved in some of the development because we trained lambda and then
[27:39.836]
uh danny and jason always also very interested in the idea of uh of reasoning right and um there's
[27:46.716]
There's this data set called the GSM-AK data set,
[27:51.156]
basically high school math problems and so on.
[27:53.616]
And it turns out that basically if you prompt the model
[27:59.176]
to give the answer directly it doesn do very well But if you prompt the model to just think step by step first right give it an example so that it can think step by step and
[28:10.004]
then after that give an answer it turns out that it gives much better result so
[28:14.504]
this idea is now a people call chain of thought reason it and then and then we
[28:24.004]
We also discovered this idea called self-insistency, but I prefer to it as the idea of parallel
[28:31.924]
thoughts.
[28:32.924]
It's that if you sample the models many, many times and then you aggregate the answers somehow,
[28:41.564]
then the model will be able to give you a here with majority voting and, you know, the
[28:49.184]
the correct name is self-consistency,
[28:51.044]
is the model would be able to answer much better
[28:57.084]
these math problems.
[29:00.184]
And then, recently, in around December last year,
[29:04.944]
we actually explored the idea of inference scaling law.
[29:13.464]
And the idea is very simple.
[29:14.924]
You know, in software engineering,
[29:16.304]
like let's say you take a sweet batch,
[29:17.472]
It will sample repeatedly many many times.
[29:21.792]
And let's say you have a judge, like an article, to pick the best answer for you.
[29:27.432]
So the more you sample, actually, it gets better and better results.
[29:36.252]
So just basically using a very weak baseline, it was able to surpass all the state of the
[29:43.892]
art by this repeatedly sample.
[29:45.852]
So there's something going on.
[29:47.012]
So these models actually turn out to really like sampling, being sampled many, many times.
[29:56.212]
And when you plot it out in the log scale, it is actually a straight line, but it's very
[30:06.252]
astonishing.
[30:07.572]
So thanks to this development of chain of thought and then parallel thoughts, we built
[30:12.572]
this thing called DeepThink.
[30:15.552]
And DipThing is a reasoning model that combines reinforcement learning, long-term thought,
[30:21.372]
and also parallel thought This year we participated at the IMO 2025 and we just using informal math We were able to solve five of our six problems
[30:40.360]
Most people didn't know this but actually
[30:44.560]
last year in 2024 we also used a Gemini
[30:50.420]
to precipitate
[30:52.540]
and it was not mentioned
[30:54.540]
in here, but we use a natural language version too,
[30:59.680]
and was able to solve half of a problem.
[31:03.120]
Half a problem in 2025, but five problem in 2025.
[31:11.660]
So that's astonishing development, right?
[31:14.200]
And behind it is a lot of reinforcement learning
[31:17.360]
and the discovery of these inference scaling laws.
[31:21.560]
And then also I want to connect the dots a little bit about alpha geometry.
[31:27.860]
Alpha geometry, we started working on alpha geometry about six years ago to basically develop an AI to solve geometry.
[31:37.860]
And we just believe that geometry is very hard to be solved by language.
[31:43.928]
solved by language models. Now curiously this year one of the geometry problems
[31:51.488]
was actually there's only one geometry problems the IMO committee now know that
[32:00.048]
we can solve geometry so well that they went from two problems per year they
[32:04.628]
now they only have one but anyway actually now they're giving us more
[32:11.168]
combinatorics problems because they know that we suck at combinatorics.
[32:16.168]
So let's see. But anyway, the thing that I find really intriguing is the
[32:25.388]
geometry problem was actually solved by the LOM just fine. And the solution is
[32:34.088]
very beautiful. Did you see? Look at all there. It looks like a journal.
[32:41.168]
You can see it in this link, but it looks like a journal.
[32:46.768]
But you know the most, the thing that I find really fascinating is that all these like
[32:52.588]
line drawing and so on it just thinking internally This language model was able to do this like think internally like because when
[33:02.816]
you solve this imo problem right you you're supposed to draw some lines for
[33:06.336]
like some dots and so on and actually the model was able to do it
[33:10.256]
without drawing anything so it internally actually think of all
[33:14.736]
these things and then try to solve it
[33:18.336]
So earlier I said compression gave rise to intelligence.
[33:27.336]
Now I change my mind and I say compression together with reasoning will give rise to intelligence.
[33:35.336]
That's a question mark.
[33:37.336]
Anyway, so from the journey, I share with you the personal journey through this, you
[33:46.756]
know, three phases.
[33:48.876]
First discovery with the pre-training.
[33:52.576]
And then the second phase around 2018 and 2021 is scaled up.
[33:59.736]
Let's see what everybody can do.
[34:01.876]
And then discovering instruction fine-tuning by accident, and also program synthesis.
[34:10.384]
And then since 2001, working on reasoning, paradigm, and then discovery of the infant scaling laws.
[34:18.384]
I want to say what comes next.
[34:23.384]
So, when I give a lot of talks, people say, did you get a lot of bitter lessons?
[34:30.384]
I say, yeah, I got a lot of bitter lessons too.
[34:33.384]
But I say there's something about, I say the nature is giving us some sweet lesson too.
[34:41.164]
And so at NeurIPS we presented this thing.
[34:44.584]
So basically the argument is the following.
[34:49.744]
Whales and elephants have bigger brains than us.
[34:53.384]
But we certainly can do math better than them.
[34:57.784]
So there's something going on about our intelligence.
[35:00.464]
It's not just about scale.
[35:01.444]
and I think there's a there's a room for a better algorithm. Apparently in nature
[35:10.944]
this is one interesting evidence, apparently in nature there is nature more than
[35:18.244]
nature actually found an algorithm for us that is superior that has a better exponent Actually that fine That kind of interesting It has a better exponent So what is an algorithm that has a better exponent
[35:32.952]
It means that, you know, maybe at small scale it doesn't work as well, but at
[35:36.112]
large scale it works so much better.
[35:37.932]
It's so amazing.
[35:39.612]
So there must be, I think in the next few years, I think we, if we need a better
[35:45.352]
exponent, throw a lot of a cube might not necessarily solve the problem. So I think
[35:52.072]
there could be a change in how we approach the architecture, the training
[35:57.232]
algorithms and so on in the next coming decade. I want to give you another
[36:04.132]
example. So around 2015 I started two projects. One is 626 and the
[36:11.032]
other is neural architecture search. A lot of people don't talk about our
[36:14.692]
neural architecture search these days anymore.
[36:17.332]
But I think the idea of,
[36:22.552]
I still believe in the idea of AI, building AI.
[36:26.412]
At some point, you want AI to actually rewrite the code
[36:31.652]
for AI.
[36:32.672]
So one thing that we did many years ago
[36:34.932]
is to search for a better architecture for Transformer.
[36:36.840]
transformer and was able to find one actually not with not it's a it's just a
[36:43.140]
better intercept it's not a better slope right it's the same same slope but just
[36:49.500]
better and you discover a new form of using a convolution inside a transformer
[36:56.160]
and it uses square activation actually nowadays some of the groups in the world
[37:02.200]
already started using square activation for large-scale LLMs, which I find very interesting.
[37:11.000]
But hopefully in the next decade or so we will find a better exponent that would be really truly
[37:15.640]
amazing. With that, I think I can welcome questions and thank you for your attention.
[37:32.200]
Thank you.
[37:33.200]
Oh, where's the mic?
[37:34.200]
Thank you, amazing talk.
[37:35.200]
I'm Isakiy, I'm at the US, I work on mechanized proofs.
[37:40.200]
I have a question about mathematics Olympiads 24 and 25.
[37:45.200]
Why did you go from lean to informal reasoning in 25 Did it make your problem harder or simpler So in 2025 we played with the model Obviously we played with the model right before with all
[38:02.428]
these competition, you know, months before that. And we know that we have an internal benchmark.
[38:09.348]
we don't think we have any shock at all.
[38:13.488]
We know that we have the alpha proof and alpha geometry
[38:16.528]
and we translated it into Lean, we solved them,
[38:19.588]
and we have much better success rate.
[38:22.908]
So we have internal data, so we know that LLM was not ready.
[38:29.708]
But in the last year or so, with the rapid development
[38:34.648]
of fine-tuning, reinforcement learning and so on,
[38:38.028]
We used the same benchmark and then we saw rapidly improvement to the point that we feel that it's ready to get go.
[38:46.428]
Now it turns out that even one week before, we still think that we're between zero and go, I think.
[38:52.628]
But we got a bit lucky, so we got go.
[38:58.228]
Thank you for the meeting talk. My question would be, what's your opinion on visual COT?
[39:03.296]
So basically you just mentioned when you guys working on the IMO problem, you realized that
[39:10.296]
even actually when you're solving the geometry questions, you don't need to actually draw the diagrams.
[39:15.296]
I'm curious what's your opinion on like because when people think about problems, especially for example spatial problems,
[39:23.296]
we can't really describe it in words and we'll probably have something like a mental image to describe the reasoning process.
[39:31.296]
Yeah. Would you think this would be a promising direction?
[39:35.296]
I think so. So the two IMO problems that we couldn't solve last year and this year
[39:41.296]
was actually spatial reasoning. So you're supposed to draw some diagrams.
[39:46.296]
One problem is about the monster travelling in a grid, right?
[39:50.296]
And then you're supposed to move the monster in a grid.
[39:53.296]
And basically a lot of this is basically counting,
[39:56.296]
counting the number of squares in the grid and so on.
[40:00.296]
And this model seems to have a hard time.
[40:03.616]
I don't know what it is about this problem that is so hard for LRMs.
[40:07.576]
It could be like you said some spatial reasoning techniques some specialties some training data maybe changing some architecture I don know right but yeah it a very interesting area hi I thank you for the meaningful talk Frank okay so I have basically two
[40:28.404]
questions at the beginning of the year a company called magic.dev came to me
[40:34.344]
and the I think the former Google CEO Eric also invest in this company they
[40:39.684]
are doing AI coding, but they are focused on the repository level of code generation.
[40:48.244]
I think that is a very challenging task, especially I think a lot of us are users of Cursor, Cloud
[40:54.584]
Code, also Gemini Code.
[40:56.964]
So since all these LMs perform quite badly because I was overworking on some of the coding
[41:05.124]
generation projects, I think the performance does not match, I mean like normal typical
[41:11.524]
engineer if they want to produce a repository level code completely using AI.
[41:17.364]
So I'm just wondering what's your take on this, like what is the big challenge if we
[41:21.964]
really want to have an AI which can write lots of code to make it like a full repository
[41:29.752]
and work at the efficiency, at the scaling that people expect it can be.
[41:37.752]
And the second question is that...
[41:39.752]
Could we do one person?
[41:41.752]
Ok, yeah, sure, sure.
[41:43.752]
So my answer is that I think for a lot of problems we find it hard to make progress on is the problem of verification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment