Created
September 3, 2025 12:46
-
-
Save ricklamers/c4cb7a8383102db916fc3a33f319df16 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[00:00.000] | |
So every time I talk about reasoning, there are actually a lot of debates about whether AOMs can reason or not. | |
[00:11.000] | |
I'm interested to see how many people believe AOMs can reason. If you believe, please raise your hand. | |
[00:20.000] | |
Oh, awesome, wow. | |
[00:23.000] | |
Actually, that's a problem hard to answer. | |
[00:29.000] | |
That really depends on definition of fundraising. | |
[00:32.000] | |
To me, actually, I have very simple definition of fundraising. | |
[00:41.000] | |
It just means intermediate tokens between input and output. | |
[00:46.000] | |
That's it. | |
[00:48.000] | |
And so actually from the literature in 2017, | |
[00:54.240] | |
Guo Di Man published a paper, actually they showed how to use intermediate tokens to solve math problems. | |
[00:59.840] | |
I think that paper was shadowed by AlphaGo and AlphaZero. | |
[01:03.280] | |
That's an amazing groundbreaking paper. | |
[01:05.680] | |
And in New York City literature actually people have used intermediate tokens to solve problems for a long time but those intermediate tokens are just program languages | |
[01:18.728] | |
So here's a simple example about using intermediate tokens. | |
[01:22.728] | |
When I set up the reasoning team in Goobrin, I constructed this toy problem, | |
[01:27.728] | |
and I was to see how to solve this problem. | |
[01:31.728] | |
So given two words, we need to calculate the last letter as the output. | |
[01:39.728] | |
If there is no reasoning, the model just shows the actual answer. | |
[01:43.728] | |
If there is reasoning, that means we need intermediate steps. | |
[02:01.728] | |
Okay, so those intermediate steps are called reasoning steps. | |
[02:07.528] | |
You may wonder why I constructed such a problem | |
[02:10.028] | |
because actually I work on neurosympathetic problems. | |
[02:14.148] | |
So it is a typical problem about string manipulation. | |
[02:23.348] | |
So why intermediate tokens matter? | |
[02:26.456] | |
So we did this work in 2023 and published in 2024. | |
[02:32.856] | |
And for any problems solved by Boolean T-case of size T, | |
[02:38.056] | |
given the size of the number of logic gates, | |
[02:42.456] | |
constant size transformers can solve it by generating OT intermediate tokens. | |
[02:48.456] | |
That is just constant, the same magnitude with T, | |
[02:53.056] | |
no exponential signs of t. | |
[02:55.056] | |
That's very important. | |
[02:57.056] | |
If we directly generate the final answer, | |
[02:59.056] | |
it either requires a huge depth | |
[03:01.056] | |
or cannot solve it at all. | |
[03:07.056] | |
So anyone has a problem | |
[03:09.056] | |
about this result here? | |
[03:11.056] | |
It's a mathematical theorem. | |
[03:19.056] | |
In the kind days, | |
[03:21.056] | |
I heard many people talk about, okay, only data and scaling matters. | |
[03:25.776] | |
Actually, the model architecture doesn't matter. | |
[03:31.496] | |
Does anyone agree with that? | |
[03:33.936] | |
So if you agree with that actually I will let you know it surprising These results only hold for the platform models not for RNNs | |
[03:45.704] | |
That means if you use a RNNs, all my results here | |
[03:48.984] | |
will not be applicable anymore. | |
[03:53.284] | |
So when I joined Google in 2017, actually when I was | |
[03:58.664] | |
interviewed with Sammy Banjo, she told me, | |
[04:02.184] | |
did you see the paper today? | |
[04:03.704] | |
called ATHENSIS-OIL NEED. I was very impressed by the title. | |
[04:09.704] | |
That's so amazing. That's how we are producing. So ATHENSIS is indeed oil we need here to prove the results. | |
[04:19.704] | |
And there's a common belief actually that a pro-chain AMS cannot resume without further prompt engineering or fine-tuning. | |
[04:26.704] | |
I don't know if anyone agrees with that. | |
[04:28.704] | |
I know in the kind days, probably heard a lot about RFI | |
[04:31.704] | |
tuning or something, or channel-source prompting. | |
[04:35.704] | |
Actually, this belief is wrong. | |
[04:37.704] | |
And the projectile models are actually ready for reasoning. | |
[04:41.704] | |
Yeah. | |
[04:45.704] | |
This model is about a way to, projectile models | |
[04:48.704] | |
is ready for reasoning. | |
[04:49.704] | |
So why we couldn't see reasoning from projectile models? | |
[04:52.912] | |
The trouble is on the decoding part. | |
[04:55.912] | |
Let's see a simple example here. | |
[04:58.912] | |
Here is how I use this example from about three or four years ago. | |
[05:08.912] | |
That's the problem we could have let larger models solve. | |
[05:12.912] | |
There's no kind of way to use larger models to solve ML problems. | |
[05:16.912] | |
I have three efforts. My dad has two efforts. | |
[05:19.912] | |
how many apples do we have? | |
[05:24.152] | |
If you just use a gradient decoding, | |
[05:25.372] | |
the first answer will be five apples. | |
[05:29.032] | |
The interesting thing is that, okay, | |
[05:30.152] | |
instead of just looking at the first token, | |
[05:31.952] | |
we could look at the second candidate from the token space | |
[05:36.972] | |
and it will start from i. | |
[05:39.232] | |
That would be interesting, right? | |
[05:40.312] | |
If the model started physically as i, | |
[05:42.632] | |
what would be next? | |
[05:43.792] | |
You'd be curious, right? | |
[05:46.092] | |
If this could start from i, that would be interesting. | |
[05:47.812] | |
Okay, i has three apples, | |
[05:49.052] | |
my dad has two more apples, and me, so he has five apples. | |
[05:52.672] | |
And three plus five, we have eight. | |
[05:55.292] | |
That's surprising, right? | |
[05:56.572] | |
That really depends on how the model generates the first one. | |
[06:00.172] | |
OK let me see more And the model will say OK start from way and see what happens And we have eight outputs in total And you start from you and see what happens here | |
[06:11.640] | |
You have three outputs and your data has two more outputs than you. | |
[06:14.640] | |
Again, correct answer. | |
[06:16.640] | |
The model will generate the, okay, the access file and it's wrong. | |
[06:21.640] | |
Okay. Do you see that? | |
[06:23.640] | |
And really depends on the first token generation. | |
[06:26.640] | |
There's no any fine tuning here. | |
[06:28.640] | |
If you have a chance to have access to a pre-trained logical model, try it and see what happens. | |
[06:35.640] | |
We don't need any Fentonium, no problem here. | |
[06:38.640] | |
And you can say that, okay, if the two start first, because I, followed by a channel of solar reasoning, | |
[06:45.640] | |
or start from you, it also works. | |
[06:47.640] | |
And, not an interesting problem, okay, if the model really has, automatically has channel | |
[06:56.560] | |
of thought reasoning from the output space, and how to select the best response, right, | |
[07:01.960] | |
by length, that's a good choice, right? | |
[07:05.680] | |
So because if the model has channel of thought reasoning, the response will be longer, it | |
[07:09.680] | |
It works, but actually we have even better choice | |
[07:15.980] | |
by local confidence. | |
[07:19.368] | |
So if a model uses chain of thought, it will have much larger confidence on the final answer. | |
[07:19.440] | |
So if a model user | |
[07:27.048] | |
For example, if they have a huge vocabulary space, you can imagine that each tool has a very small probability. | |
[07:33.308] | |
However, if the model has a chain of thought reasoning before the final answer, the confidence here can be up to 99%. | |
[07:40.588] | |
That's really surprising. | |
[07:43.648] | |
So we call this approach as channel-source decoding, not channel-source prompting. | |
[07:49.408] | |
Because this is only for pre-training models, there's no prompting engineering, there's no examples or something. | |
[07:56.288] | |
So just go beyond gradient decoding by checking more generation candidates, | |
[08:01.248] | |
and then choose the candidates which have the highest confidence on the finance. | |
[08:06.928] | |
That's how to get a reason. | |
[08:08.048] | |
Okay, now you can see that, okay, if the model already has a vision path on the final output | |
[08:15.808] | |
space, what should we do? | |
[08:18.548] | |
So can we reshape the model's output distribution so that the sort-for response rank as the | |
[08:23.808] | |
first one? | |
[08:24.808] | |
If the chain vision is ranked as the first one then we can fund it by graded coding right Because by default we use a large number of models the output is from graded decoding | |
[08:40.656] | |
So now we can look at the chain of subprime team, | |
[08:42.616] | |
we published in New York, 2032, | |
[08:45.676] | |
and Jeff and the co-authors just wrote this paper | |
[08:48.336] | |
in the talk. | |
[08:49.816] | |
So basically we just need to put one example in the beginning | |
[08:53.616] | |
and then ask the question we want to ask. | |
[08:56.776] | |
Why it works? | |
[08:58.016] | |
Now you should see that, why it works. | |
[09:00.836] | |
Because once you put the models, | |
[09:03.296] | |
actually it's just probability models, | |
[09:05.636] | |
they are machine learning models, they are not humans. | |
[09:08.136] | |
We put those prompts at the very beginning, | |
[09:10.416] | |
then we change the probability of the output. | |
[09:14.236] | |
Now change the prompting can lead to larger probability | |
[09:21.176] | |
on the response with reasoning paths. | |
[09:24.516] | |
And actually the most striking thing is another paper at the same year, it's called Lessons | |
[09:30.516] | |
Divert Step. | |
[09:31.516] | |
So they don't use any examples. | |
[09:33.516] | |
They just follow the question with Lessons Divert Step. | |
[09:37.516] | |
This year can change the distribution of output and to make the channels of reasoning pop | |
[09:43.516] | |
up as a cop. | |
[09:45.824] | |
and then by and founded by a degree decoding so probably was quite popular | |
[09:55.484] | |
for a while but we have pros and cons of prompting probably the process is | |
[10:01.204] | |
simple and works the cost part of it so cheap prompting is hard to see the same | |
[10:08.444] | |
course and that's my staff is generic but performs a much worse if you try | |
[10:14.424] | |
you'll see it. | |
[10:18.424] | |
And the more interesting, you know, | |
[10:21.424] | |
probably approach is actually quite weird for visual prompting. | |
[10:26.424] | |
And for example, if I ask someone a question, | |
[10:31.424] | |
I would say, okay, yeah, these are similar problems and solutions, | |
[10:34.424] | |
and then ask the question. That's kind of weird, right? | |
[10:36.424] | |
If you ask someone a question, just ask that. | |
[10:39.424] | |
You only say something, use similar examples, | |
[10:42.424] | |
or say something that's the worst out. | |
[10:51.504] | |
So to fix that problem because I don want to first show similar examples before ask a question right So we have to fix it How to fix it And the why of the approach is by suppressed fatality We just collected a set of problems and the step solutions from human annotators | |
[11:10.152] | |
And this step two marks the likelihood of human solutions. | |
[11:14.252] | |
And then just apply the model everywhere. | |
[11:17.032] | |
So that approach was quite popular for a while and we can see the paper in the open, the | |
[11:22.092] | |
in 2017 and also the paper from OpenAI and how they use the model to solve mass model problems. | |
[11:29.092] | |
And also scratch pad papers, serial work, scratch pad for intermediate computation with large model. | |
[11:40.092] | |
If a business problem can be solved in this way, I wouldn't need to talk about more, right? | |
[11:46.092] | |
Because it's so straightforward. You just collect a lot of problems and answers and do Fentonian. | |
[11:51.092] | |
That's it. | |
[11:52.232] | |
That's called a supervised function. | |
[11:56.812] | |
And let me show you how a supervised function really works | |
[11:59.572] | |
if you didn't do that before. | |
[12:01.172] | |
You know, just show some examples here. | |
[12:03.352] | |
The first problem is, for example, | |
[12:04.872] | |
you can collect examples of worst output | |
[12:06.772] | |
when calculating the last letter of each word | |
[12:09.512] | |
in artificial intelligence, okay? | |
[12:11.172] | |
We show step-by-step solutions. | |
[12:12.280] | |
And then you follow another example, so ASI has three IPOs, ANA has two IPOs, and again, | |
[12:18.980] | |
you can give you demonstration of how to solve this problem. | |
[12:22.940] | |
And then use the examples as training data to fine-tune your AOM. | |
[12:29.040] | |
And you put training AOM will be fine here. | |
[12:31.180] | |
And finally, you can use a test problem. | |
[12:33.620] | |
How many hours in strawberry? | |
[12:35.620] | |
Well, I chose this problem here, and so many people in social media see this problem as | |
[12:41.320] | |
as a test to see if AGI has come yet or not. | |
[12:51.640] | |
So SFT actually is quite generic. | |
[12:53.920] | |
Once you have a functional model, | |
[12:55.180] | |
you just apply everywhere. | |
[12:57.160] | |
The counterpart actually, it doesn't generalize well. | |
[13:02.640] | |
So if you look at the literature, | |
[13:04.760] | |
so many people tried to, | |
[13:07.160] | |
people all talk data and scheme, okay. | |
[13:09.780] | |
And they exactly use the same method here. | |
[13:13.020] | |
You know, if they found that SFG doesn't work well, | |
[13:15.880] | |
oh yeah, we need to collect more high quality SFG data | |
[13:19.460] | |
and then to do the fine tuning Actually that doesn help much So the lesson here don scale blindly | |
[13:32.508] | |
If the methodology is wrong, no matter how you scale, | |
[13:35.008] | |
it won't work. | |
[13:41.908] | |
So if you don't know how to work well for reasoning, | |
[13:47.068] | |
that's a good way to know that. | |
[13:49.248] | |
And don't waste your time here. | |
[13:54.148] | |
How to fix it? | |
[13:56.528] | |
How to fix it? | |
[13:58.208] | |
Let's review the process of SFT, okay? | |
[14:02.908] | |
Step one, we collect a set of columns, | |
[14:05.868] | |
and this step by step solution from human annotators. | |
[14:09.968] | |
Okay, step two, and maximize the likelihood | |
[14:12.368] | |
of human solutions. | |
[14:14.588] | |
That's just the maximize likelihood | |
[14:17.968] | |
actually this is a normal procedure in machine learning. | |
[14:20.068] | |
And here, just putting that token here. | |
[14:22.808] | |
Okay. | |
[14:25.428] | |
What's wrong here? | |
[14:27.668] | |
It's from human annotators. | |
[14:30.648] | |
It's strange, right? | |
[14:31.688] | |
People say, yeah, we need high quality data from experts. | |
[14:35.428] | |
It's wrong here. | |
[14:37.408] | |
We should use human annotators. | |
[14:38.736] | |
So, in 2022, I actually did a very nice paper by Stanford and the Go Green team, and they | |
[14:54.216] | |
called it STAR, and it was tracking reasoning with reasoning. | |
[14:57.716] | |
So instead of using data from human indicators, we collected data from the model, which is | |
[15:07.096] | |
highlighting the task to shield change, | |
[15:10.456] | |
collect a set of problems, and this step | |
[15:13.016] | |
has solutions generated from the model, | |
[15:16.336] | |
and then maximize the likelihood of correct solutions. | |
[15:19.296] | |
That's it. | |
[15:21.416] | |
And also there's a similar people in that year | |
[15:23.576] | |
is called large-scale models can self-improve | |
[15:26.416] | |
and by people from my team. | |
[15:31.736] | |
That's a change, okay. | |
[15:32.976] | |
Change the huge generator to model. | |
[15:36.136] | |
It's weird, right? | |
[15:37.296] | |
You may say the model, | |
[15:38.436] | |
because we need to train the model for a reason. | |
[15:40.676] | |
Why is Jeff collecting data from the model? | |
[15:44.496] | |
That's a tricky part. | |
[15:48.896] | |
Okay the model of course is weird right That fine we just repeat this process You just repeat that again after the model | |
[16:00.384] | |
maximize the likelihood of the model, | |
[16:02.044] | |
the weights are changed, updated, | |
[16:04.364] | |
and then you can repeat the step one and step two again. | |
[16:08.104] | |
So that paper, so I put a paper here, | |
[16:12.024] | |
I know people will say, okay, | |
[16:13.364] | |
you should put a deep-seq paper there. | |
[16:16.204] | |
Actually before Deep Seek there are other papers. | |
[16:18.664] | |
This paper published in January, 2024 is called IEFT, | |
[16:24.364] | |
Reasoning with Reinforced Functioning. | |
[16:28.424] | |
The terms sound similar, right? | |
[16:31.004] | |
And the paper put on the archive and no one cared. | |
[16:36.404] | |
So actually academic people can do good work | |
[16:39.224] | |
and people just didn't fund it. | |
[16:46.204] | |
And you could wonder why it works, right? Why this is important? | |
[16:56.204] | |
Actually, you can google the Alpha-20 techniques, it's covered by two engineers on my team, Jonathan Lai and Jim Tsang. | |
[17:05.192] | |
So if you look at the RIO-Fantulian procedure, actually a critical part is our verifier. | |
[17:12.192] | |
How do you know the solution is correct? | |
[17:15.192] | |
Ray Sutton, you said, by patient K to AI. | |
[17:21.192] | |
So a reliable verifier is the most crucial in RIO-Fantulian, not RIO algorithms. | |
[17:27.192] | |
I know actually when the R-funtuning term was popular in the social media, many people | |
[17:35.472] | |
got hyped, oh yeah, we should try more advanced R algorithms. | |
[17:39.372] | |
Actually, I didn't think it was related to R algorithms that much. | |
[17:44.392] | |
The essential part is about, about what? | |
[17:48.052] | |
Anyone can say that? | |
[17:49.052] | |
The essential part here is about where your solution generated. | |
[17:54.392] | |
It's not from human annotators, from the model. | |
[17:59.292] | |
That's the question, that's the part, yes. | |
[18:02.752] | |
Yeah, you see, why are generated from model | |
[18:06.052] | |
instead of from humans, right? | |
[18:09.452] | |
I believe SCAII doesn't want to say it. | |
[18:14.392] | |
They want to come out of the model the data Oh yeah I thought that related to the first principle in machine learning Actually I started my career from machine learning | |
[18:28.360] | |
The first principle in machine learning | |
[18:29.700] | |
actually is very simple. | |
[18:31.600] | |
If you want to get a performance, | |
[18:33.540] | |
directly optimize what you want. | |
[18:38.180] | |
If you care about classification accuracy, | |
[18:40.900] | |
optimize it, okay? | |
[18:43.920] | |
If you care about producing x3, optimize it. | |
[18:49.680] | |
Do it, yeah. | |
[18:50.840] | |
So now the problem, okay. | |
[18:52.520] | |
So what do we want to optimize here | |
[18:55.020] | |
for training large-round models? | |
[18:57.120] | |
Of course, we want to optimize generation quality, right? | |
[19:01.820] | |
That's how users can see it, generation quality. | |
[19:04.340] | |
So we need a metric of measuring generation quality, okay? | |
[19:09.440] | |
So there are a lot of ways to measure the generation quality. | |
[19:12.520] | |
For solving math problem, like animal problems, | |
[19:15.500] | |
we care about correctness. | |
[19:17.800] | |
If we care about machine translation, | |
[19:20.140] | |
that a blue score, right? | |
[19:21.720] | |
For competitive programming, | |
[19:24.500] | |
that will be your unit test with your metric. | |
[19:27.840] | |
Okay, once you can define the metric, | |
[19:31.480] | |
and then, | |
[19:31.648] | |
And then all the rest is the compute gradients and the upper weights. | |
[19:40.288] | |
Did you see any I.O. idea here? | |
[19:43.368] | |
No I.O. here, right? | |
[19:44.928] | |
Just about how to compute gradient. | |
[19:50.948] | |
Okay, we can do a bit of formal mathematical description here. | |
[20:03.608] | |
So R means the measurement of quality of generation. | |
[20:08.008] | |
Given the problem, given the weights, weights indicated by theta here, and the model with | |
[20:13.308] | |
general response, we need to measure the quality of the response. | |
[20:17.188] | |
use R, R because people say it's a reward or something. | |
[20:20.688] | |
Okay, I don't care the name, it could be reward or anything. | |
[20:24.688] | |
Like a blue score, like a problem, | |
[20:28.188] | |
and you want to compute the, | |
[20:31.508] | |
you want to match this objective function here. | |
[20:34.888] | |
In machine learning, if you want to match, | |
[20:36.508] | |
if you want to optimize something, | |
[20:38.248] | |
what we should do? | |
[20:39.468] | |
Compare gradient Now we need to compute the gradient of this objective function It a little bit tricky to compute the gradient here Anyone see why it tricky here to compute the gradient | |
[20:52.376] | |
Because there is an impactation. | |
[20:56.376] | |
There is an impactation here. How to compute the gradient of an impactation? | |
[21:00.376] | |
How to do it? | |
[21:04.376] | |
Yeah, just by sampling, right? | |
[21:08.376] | |
For example, if I buy a coin, and you're going to see it's a pair of tails or a hat. | |
[21:14.856] | |
You just need to toss the coin multiple times and look at the frequency. | |
[21:19.176] | |
The same here, we need to sample multiple times. | |
[21:23.256] | |
And then you use the sample to compute a gradient. | |
[21:25.736] | |
And then this gradient has a special name, it's called a policy gradient. | |
[21:29.896] | |
That's it. | |
[21:30.696] | |
Once you compute a policy gradient, and then just do back propagation, | |
[21:34.296] | |
just like any normal training procedure. | |
[21:38.376] | |
You will see, okay, that's why in pre-training stage, | |
[21:41.496] | |
you want to use RL. | |
[21:42.816] | |
Because in pre-training stage, | |
[21:43.776] | |
we clearly know which is the net token, | |
[21:45.496] | |
which is the direct computer gradient. | |
[21:47.236] | |
There's no impetus in there. | |
[21:48.716] | |
But here we have impetus, we have to do something. | |
[21:51.816] | |
Yeah, so the process is actually the same as pre-training. | |
[21:57.536] | |
Okay, once we identify | |
[21:58.104] | |
Once we identify the right approach for training, | |
[22:02.264] | |
we have to think about how to do scaling. | |
[22:05.224] | |
That's why both are important. | |
[22:07.664] | |
If the method is wrong, no matter how we do scaling, | |
[22:10.744] | |
it won't work. | |
[22:11.764] | |
Just like we use FCT for reasoning, | |
[22:14.744] | |
no matter how we do scaling, that won't work. | |
[22:16.784] | |
However, if we use this approach, | |
[22:18.584] | |
use RFM tuning, it's a right approach, | |
[22:20.704] | |
but think about how to do scaling. | |
[22:24.364] | |
So in the first slide, I talk about the SIRAT results. | |
[22:27.964] | |
we obtained in 2023. | |
[22:30.924] | |
For any problem solved by Boolean circuits of size t, | |
[22:35.024] | |
constant size transformers can solve it | |
[22:36.764] | |
by generating OT intermediate tokens. | |
[22:40.464] | |
From this theorem, you can clearly see | |
[22:42.104] | |
how to scale the model, right? | |
[22:45.404] | |
We need to scale in the... | |
[22:53.384] | |
Sounds. | |
[22:54.224] | |
When the daughter's girl's son was in the school. | |
[23:01.104] | |
Yeah, that's why, you need just a, | |
[23:02.804] | |
how to scale in numbers, that means, | |
[23:04.804] | |
let t go infinity that be how to scale right If we and also for the second result that means okay if we don scale in output lengths | |
[23:19.792] | |
we can instead scale in model depths. | |
[23:23.092] | |
There are two ways for scaling, yeah. | |
[23:28.852] | |
So in the kind days when people talk about | |
[23:31.292] | |
use time scaling, right? | |
[23:33.752] | |
So this is a theoretical foundation for use type scaling. | |
[23:37.292] | |
So that's why we need to use that long CT here. | |
[23:43.192] | |
So now we see that actually, so far I didn't talk | |
[23:48.432] | |
about anything about search or MCTS, right? | |
[23:51.632] | |
That's the beauty of algorithm reasoning. | |
[23:54.152] | |
Human language reasoning process emerged | |
[23:56.872] | |
from token to token generation. | |
[23:59.412] | |
And rather than relying on source of search | |
[24:02.132] | |
as in the classical era. | |
[24:04.432] | |
I think in 1997, so, | |
[24:09.572] | |
Garrett Kasparov said something very funny. | |
[24:12.252] | |
After listening to Deep Blue, | |
[24:14.112] | |
I don't know if anyone still knows Deep Blue by IBM, | |
[24:17.412] | |
at that time, and he said Deep Blue was only intelligent | |
[24:22.432] | |
the way your programmable alarm clock is intelligent. | |
[24:24.560] | |
clock is intelligent. But in the current days, LMs are quite different. | |
[24:32.160] | |
Let me show an example of how AMM solved this problem. I used the GMI 2.0 | |
[24:40.820] | |
thinking model, released in December of 2024, and I did this test around Christmas time. | |
[24:48.940] | |
so I just want to make sure that probably is not on the web so I see that | |
[24:55.300] | |
the model was released 2004 so I use a number 2025 in the problem so using the | |
[25:01.200] | |
numbers from 1 to 10 to make 2025 using each number once and primary | |
[25:09.480] | |
operations plus the modification okay and then the left side is the solution | |
[25:15.920] | |
the right side is the single process. | |
[25:22.420] | |
You can see that the model doesn't do blue force search. | |
[25:27.480] | |
For example, the model could write a pattern program, | |
[25:29.360] | |
search all possible solutions, right? | |
[25:31.860] | |
So let me see, as you said, | |
[25:33.000] | |
really interesting to look at a single process You can first see that the model said okay this is a relatively large number suggesting multiplication will be heavily involved | |
[25:44.168] | |
I said that's meaningful insight, right? That's why I believe this is AI, not just by search. | |
[25:52.168] | |
And it's also worth noting that 2022 is 45 times 45. | |
[26:02.168] | |
Actually when I looked at this problem, I didn't realize that. | |
[26:06.168] | |
45 times 45 equals 2025. That's amazing. | |
[26:11.168] | |
And we see the target is large, start thinking about how to get large intermediate products | |
[26:22.328] | |
using multiplication. | |
[26:25.408] | |
And finally that's aimed to produce that get closer to the square root of 25. | |
[26:30.208] | |
You see that? | |
[26:31.208] | |
It's not kind of a search. | |
[26:33.888] | |
It's really by thinking. | |
[26:40.808] | |
And usually people talk about the good lesson | |
[26:46.668] | |
from Rachel Sutton. | |
[26:47.528] | |
Actually from the origin essay, | |
[26:50.368] | |
Rachel Sutton talk about two things. | |
[26:51.016] | |
about two-sense learning, two-sense scalable learning and search. | |
[26:55.016] | |
Actually, for me, I'm not interested about the search part. | |
[27:00.016] | |
I'm really interested about the last paragraph in case I see. | |
[27:04.016] | |
So we want AI agents that can discover like we can | |
[27:08.016] | |
and which contains what we have discovered. | |
[27:11.016] | |
Building all discoveries only makes it harder to see | |
[27:14.016] | |
how the discovery process can be done. | |
[27:17.016] | |
Now you can see the RF-funtility part is consistent with Richard Sarton's statement here. | |
[27:25.376] | |
We don't use human annotated data. | |
[27:27.296] | |
We don't use data from the experts in the scale AI, but we use the solutions generated | |
[27:33.616] | |
by the human, by the model itself, and do it again and again. | |
[27:44.856] | |
So in practice, there are functional model | |
[27:49.056] | |
generalized really well for automatically | |
[27:52.956] | |
via fiber husks. | |
[27:54.916] | |
The counterpart is not all husks are automatically | |
[27:58.056] | |
via fiber Can anyone give an example on a non task Difficult to read Of course yeah More examples | |
[28:11.744] | |
Actually, I want to give a fun example | |
[28:15.744] | |
from my little one. I have two kids actually. | |
[28:19.744] | |
Today I just talked about my daughter. Actually, I have a son. | |
[28:23.744] | |
So during lunch time | |
[28:27.744] | |
in Wigan and my son cried so much and asked us just go to lunch as soon as possible. | |
[28:35.744] | |
And the kids are so angry, he's dying. | |
[28:39.744] | |
And his mom said, okay, I don't think you're going to die. Let's spend $100. You won't die. | |
[28:46.744] | |
And my son said, that's nonsense. I'm going to die. Why do you care about $100? | |
[28:52.744] | |
And, oh, I was so shocked, okay. | |
[28:56.244] | |
And then I put a scenario into GPT-4-0 | |
[28:59.184] | |
and see how GPT can answer. | |
[29:02.844] | |
And GPT answers, I asked GPT, | |
[29:05.524] | |
so do you think what my son will see? | |
[29:09.844] | |
GPT model 7. | |
[29:11.164] | |
Your son will say, ha ha, yeah, that's bad. | |
[29:13.064] | |
Well, I won $100. | |
[29:15.924] | |
I said, this kind of task is not by a fiber. | |
[29:17.472] | |
by fiber but for humans if you look at answers you can clearly see which answer is better yeah | |
[29:28.032] | |
um okay so now i talk about okay about the definition about everything what i care about | |
[29:35.232] | |
and again i have to say i don't care about those delays if m can reason or not as a person in | |
[29:42.272] | |
industry I just care about performance improvement. If I use intermediate tokens | |
[29:47.372] | |
I can see much better performance that's all I need. If someone says there's no reasoning | |
[29:52.832] | |
I don't care that's why I never get involved in the debates with the yen or | |
[29:57.692] | |
carry marks. And of course I care about how to further improve the | |
[30:04.512] | |
performance on reasoning. So one is about aggregation and the other is about | |
[30:10.212] | |
retrieval. Yeah, that's about it. Okay, after we introduce intermediate tokens, the decoding process becomes a little different. | |
[30:23.212] | |
Ember reasoning is powerful Do you see any decoding issues there Generating reasoning tokens and then find an answer That different from directly generating a final answer | |
[30:38.200] | |
So did anyone see any issue here when you use reasoning? | |
[30:45.660] | |
So okay, I know in the country there are so many people who just simply | |
[30:50.560] | |
know, think of the LMS as humans. Actually, you always keep in mind, | |
[30:55.960] | |
large models are not humans. They are machine learning models. They are even | |
[31:02.400] | |
they are private models. If you always keep it in mind, that will be helpful for you | |
[31:07.240] | |
to understand the medicines. So let's see, on the mathematical side, let's see, okay, what | |
[31:16.640] | |
What am I doing in decoding? | |
[31:18.840] | |
So we're given a problem. | |
[31:21.420] | |
If the model generates a reasoning path | |
[31:23.400] | |
and a final answer, and the model, for the decoding path, | |
[31:26.780] | |
the model will just generate the response | |
[31:29.580] | |
with maximum probability. | |
[31:33.960] | |
Any problem here? | |
[31:38.120] | |
For humans, what do we care about here? | |
[31:43.000] | |
What we want here is, we want us | |
[31:43.928] | |
is we want to solve the problem, | |
[31:46.428] | |
I want to see the answer with a maximum probability. | |
[31:53.488] | |
They are not aligned, right? | |
[31:57.168] | |
Mathematically, they are not the same thing. | |
[32:00.948] | |
How to fix it? | |
[32:03.888] | |
We just need a simple math here. | |
[32:06.328] | |
Because given this problem, | |
[32:08.828] | |
we have many reason paths | |
[32:12.048] | |
for the final answer, we just need to sum over all original paths. | |
[32:18.048] | |
Of course, the sample space is huge, we cannot precisely compute the property sum here, | |
[32:28.048] | |
we just need to sample from the output. | |
[32:35.048] | |
and we use our reason here. | |
[32:38.708] | |
The sampling approach is quite normal in machine learning. | |
[32:42.928] | |
So this approach is called self-consistency, | |
[32:45.528] | |
and we use that to improve channel-source reasoning. | |
[32:50.188] | |
You know, it's such a simple idea. | |
[32:52.348] | |
So when we submitted the paper which companies have got and got rejected And so that why you publish such a simple idea everyone knows that called budgetable voting right | |
[33:04.656] | |
And it generates multiple responses by random sampling and it shows the answer | |
[33:08.656] | |
that appears most frequently. So now you see that | |
[33:12.656] | |
the underlying principle is about | |
[33:16.656] | |
the computer that shows the answer with maximum probability. | |
[33:20.656] | |
and if you use it in terms of machine learning, | |
[33:23.656] | |
there's a way to marginalize latent variables. | |
[33:26.496] | |
Here the latent variables are basically passed. | |
[33:33.496] | |
So actually I have two simple questions here | |
[33:36.136] | |
if you're to check if you really understand | |
[33:38.936] | |
the underlying mathematical principle of self-consciency. | |
[33:42.196] | |
It's really powerful. | |
[33:43.276] | |
Actually, when you course talk today, | |
[33:46.336] | |
it mention of that, okay, how self-consciency | |
[33:48.476] | |
is related to privacy. | |
[33:50.656] | |
to solve ML problems. | |
[33:53.496] | |
So, let me show you how | |
[33:56.036] | |
set consistency works here. | |
[33:58.476] | |
Given this problem, you sample the first answer, | |
[34:01.536] | |
you sample the second one, | |
[34:03.976] | |
and you sample again, you see that | |
[34:06.656] | |
the most frequent answer here is $18. | |
[34:09.276] | |
Let me use that as a fundamental. | |
[34:10.384] | |
you chose that as a fun answer. So the one here, we chose the most frequent answer, | |
[34:16.164] | |
it's not about the most frequent recent paths. Recent paths here is the latent variables, | |
[34:21.564] | |
they are marginalized. So subclasses really can lead to huge improvements. | |
[34:29.564] | |
So that's the mysterious part, you know, mathematically I can understand why we need subclasses, | |
[34:35.464] | |
But I didn't expect the improvement can be huge. | |
[34:40.664] | |
So we just, for GSMK problems in 2023, | |
[34:44.004] | |
we just use 80 examples. | |
[34:46.824] | |
And we offer a high self-conciency, | |
[34:51.024] | |
we can see huge gains from 58% to 75%. | |
[34:55.564] | |
The reality of gains is about 50%. | |
[35:00.664] | |
And, you know, I use the old results, | |
[35:04.864] | |
is not just for old models, even for new models as well. | |
[35:08.944] | |
If you look at OpenAI's O1 time report, | |
[35:12.104] | |
they use aggregation as well. | |
[35:17.124] | |
And the only people also talking about | |
[35:18.924] | |
the confidence about the financials And here I want to say okay the higher consistency indicates higher accuracy They are just kind of linearly correlated Okay yeah now I have two quiz here and | |
[35:37.532] | |
let's see if you really see the principles of underlying self-consistency. The first | |
[35:43.172] | |
one is when the LLM outputs a direct answer without intermediate steps, will your sample | |
[35:48.692] | |
several times and then choose the most common answer to you. Can anyone give an answer? | |
[36:00.212] | |
Yes. | |
[36:04.212] | |
Anyone say no? | |
[36:07.092] | |
Oh, thank you, no. Yes, the answer is no, right? If it's just one token for your final answer, | |
[36:13.812] | |
you don't need to sample multiple times, right? | |
[36:16.692] | |
Because you just need to look at the output probability. | |
[36:20.432] | |
That's the data probability from net token prediction. | |
[36:23.872] | |
Yeah, that's why I didn't see | |
[36:27.892] | |
self-conceal using normal machine learning process. | |
[36:30.572] | |
When you use a logit regression, | |
[36:32.452] | |
and you just maximize the local likelihood, | |
[36:34.872] | |
that's exactly the maximum probability there. | |
[36:36.840] | |
you don't need to do any sampling. | |
[36:39.180] | |
If you use sampling, you'll just get the same results. | |
[36:42.940] | |
Okay, for the second question, if we choose, | |
[36:44.680] | |
if we set a data AM, generate multiple response, | |
[36:48.260] | |
instead of sampling multiple times, | |
[36:50.480] | |
okay, do you think that makes sense? | |
[36:57.020] | |
Yes, no? | |
[37:00.440] | |
Right, so you see that we just maximize the probability. | |
[37:03.920] | |
Simply, it's from a mathematical equipment. | |
[37:07.960] | |
Yeah, the basic is just about how this, | |
[37:10.560] | |
if you have some background on machine learning, | |
[37:12.660] | |
that's called maximum marginal inference. | |
[37:18.300] | |
That's the mathematical principle underlying Cepron-CC. | |
[37:22.160] | |
Okay, people say, okay, how do you use free-form answers? | |
[37:25.560] | |
So that's called universal Cepron-CC, | |
[37:27.540] | |
and it's from our work here. | |
[37:31.100] | |
And we just simply ask AOMs to select | |
[37:33.780] | |
most concerned answer and then you can use this approach to any problem like | |
[37:39.180] | |
summarization, machine translation, all of things. So here's a simple example here | |
[37:45.180] | |
you have a response here and you can look at answers and it very interesting okay from response one they say Japan China and the United Kingdom and the second one | |
[37:57.388] | |
Japan China and India right and what would be most consistent answer here if you ask | |
[38:01.988] | |
LMS they will say okay that's Japan China and Saudi Arabia blah blah | |
[38:08.608] | |
Okay, now I talk about another interesting about speed-receive. | |
[38:15.608] | |
Okay, great. So I will give a quick here. | |
[38:20.608] | |
And it's about reasoning or retrieval. There's all the debate here, | |
[38:24.608] | |
and people always say that I'm just kind of not reasoning. | |
[38:26.608] | |
So my answer here, you know, we should do retrieval plus reasoning. | |
[38:30.608] | |
It's not just, again, I'm not interested in any debate about those patients. | |
[38:35.608] | |
I just care about performance. | |
[38:39.208] | |
And then the problem here, you know, | |
[38:40.888] | |
you want to solve this problem, | |
[38:42.648] | |
and then we can model a correlated problem, | |
[38:45.508] | |
and then solve it, yeah. | |
[38:47.928] | |
And there's another work, it's called a step-back prompting, | |
[38:50.488] | |
we did in 2024. | |
[38:53.108] | |
Actually, all the work we did in the 2023, | |
[38:55.168] | |
we didn't publish anymore after that. | |
[38:58.528] | |
And that actually, what we did is do research, | |
[39:01.628] | |
you know, if it's deep research, | |
[39:02.848] | |
that would essentially just light up | |
[39:03.296] | |
Essentially, just like a model to search relevant examples of problems or knowledge before solving this problem. | |
[39:09.596] | |
You can put all the knowledge in the context to solve it. | |
[39:14.236] | |
Yeah, that's the final summary, and that's it. Thank you. | |
[39:18.036] | |
Thank you. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment