Created
September 3, 2025 12:45
-
-
Save ricklamers/8d688d216d9543f2df6c06af742b2132 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[00:00.000] | |
I'll give you a whirlwind overview of a bunch of important developments that have happened in maybe the last decade or so of AI development. | |
[00:10.280] | |
Just to give you a sense of all the different things that have been developed independently and then all put together to make the models of today really shine in all the different capabilities that they have. | |
[00:24.160] | |
and this is joint work with many many many many people so can we just miss the | |
[00:31.980] | |
little funny permission thing I think it's in the slides we skipped because | |
[00:36.280] | |
okay so a few observations about why we are where we are so machine learning has | |
[00:42.780] | |
really changed our expectations over what is possible with computers I think | |
[00:46.680] | |
if you look back ten years and thought about okay our computer is smart can | |
[00:51.560] | |
Can they do interesting kinds of reasoning? | |
[00:53.840] | |
Can they really understand visual inputs | |
[00:56.740] | |
and sort of make use of them or generate visual outputs? | |
[01:00.860] | |
We would say that seems pretty far off, | |
[01:03.380] | |
but in the past decade, | |
[01:04.780] | |
we've actually made a whole bunch of advances | |
[01:08.420] | |
and the advances are really founded on a few different things So one is increasing scale has been really really useful Training larger models on more data We been doing work in the Google Brain team maybe even 12 13 years ago and we had | |
[01:23.488] | |
a really good saying, bigger model, more data, better results. | |
[01:27.128] | |
And we would often repeat that. | |
[01:29.348] | |
I guess that's like the summary of the scaling was, but we weren't very scientific about | |
[01:33.668] | |
it. | |
[01:34.668] | |
bigger models and more data and that generally would work. | |
[01:39.408] | |
But algorithmic and model architecture improvements have really been | |
[01:42.148] | |
massive drivers of the proof capabilities of these models as well. | |
[01:46.788] | |
And then the kinds of computations we want to run compared to, say, 15 years ago | |
[01:50.428] | |
are really quite different than traditional twisty C++ code or something. | |
[01:56.648] | |
And that really has changed in the kinds of computing hardware that we're building. | |
[02:00.808] | |
and advances in that have helped with increasing the scale | |
[02:05.588] | |
and also have helped with sort of thinking about algorithms | |
[02:08.628] | |
that work especially well on the kinds of new hardware | |
[02:11.568] | |
that they think about. | |
[02:14.948] | |
Okay, so 15 years of machine learning advances | |
[02:18.368] | |
or how did today's models come to be? | |
[02:21.248] | |
I mean, this is obvious to everyone here, | |
[02:23.668] | |
but neural nets have been like this kind of redistributed | |
[02:26.456] | |
kind of rediscovered thing. In the late 80s and early 90s, there was a lot of excitement | |
[02:32.496] | |
about no-lands because they seemed like the right abstraction for how do you build really | |
[02:37.536] | |
flexible systems that can learn from very raw forms of input. And so they've turned | |
[02:43.296] | |
out to be a key building block, but there was a bit of a bump in the road in the early | |
[02:47.696] | |
90s to about 2008 or 2009 where people were kind of unexcited about no-lands and only | |
[02:55.376] | |
have dedicated few, kind of actually focused on them and really kind of tried to push them | |
[03:00.196] | |
forward. And then obviously back propagation has been the key way in which these models | |
[03:07.616] | |
can learn. So those two things really have existed for a long time and it's only since | |
[03:13.416] | |
maybe 2010 or so that we've really rediscovered and sort of reinvigorated them with scale. | |
[03:19.536] | |
it turns out that what we needed to make them work really well was large-scale. | |
[03:25.456] | |
But even in 1990 when I saw neural networks I'm like oh this is great this | |
[03:31.116] | |
is the right abstraction for everything So I got exposed to them in a you know a three lecture part of a course I took and I said oh maybe we just need to make really really big neural networks So let try to train them on the 32 processor machine in the department instead of one processor | |
[03:49.584] | |
Then we could have massive neural networks. It would be great. | |
[03:52.264] | |
Look at these models. They're like three layers, 10, 21, and 10 neurons per layer. | |
[04:00.624] | |
It's going to be amazing. | |
[04:01.984] | |
So I actually did an undergrad thesis on parallel training of neural networks. | |
[04:06.984] | |
And I came up with two different approaches, which I called the pipelined approach and the pattern partitioned approach. | |
[04:13.984] | |
So it turns out pattern partitioning is what we now call data parallelism and pipelined approach is what we call model parallelism. | |
[04:22.984] | |
But it turned out it didn't work very well. And the reason is, well, I made two mistakes. | |
[04:27.984] | |
One is I didn't try to make the model bigger | |
[04:30.284] | |
as we increased the number of processors. | |
[04:32.024] | |
That would have been a really good idea | |
[04:33.184] | |
because then it would have gone much closer to that | |
[04:35.584] | |
nice 45 degree line. | |
[04:38.324] | |
But really we just needed like a million times | |
[04:41.064] | |
more compute processing than we had, not 32 times. | |
[04:44.344] | |
And so we needed to wait for the semiconductor industry | |
[04:47.604] | |
and computer architects to really work their magic | |
[04:50.304] | |
for two decades. | |
[04:51.484] | |
And then starting in 2008 or nine, | |
[04:52.912] | |
eight or nine, we then started to have enough processing power that these sort of very general | |
[04:58.192] | |
learning-based approaches of multi-layered neural networks and made deeper and deeper | |
[05:03.692] | |
could actually start to do really amazing things. | |
[05:09.012] | |
And so in 2012, we started a project at Google kind of based around the idea of let's just | |
[05:16.152] | |
make really big neural networks with lots of computers and see what happens. | |
[05:20.532] | |
This was the Google Brain effort, and we actually combined model parallelism and data parallelism | |
[05:26.432] | |
into one system, so we'd use both of these approaches, and we would try training a really | |
[05:32.812] | |
big neural network. | |
[05:34.352] | |
Kwok was involved in this work and actually drove one of the first uses of the system, | |
[05:38.612] | |
which was to do unsupervised learning on YouTube video frames, randomly selected YouTube video | |
[05:43.972] | |
frames. | |
[05:44.972] | |
But we were actually able to use this system to train a neural network that was 50 times | |
[05:49.072] | |
larger than what anyone had trained in the past. And we got surprisingly good results. | |
[05:56.372] | |
In fact when we did unsupervised learning on 10 million single frames taken from 10 million different random YouTube videos and we used an unsupervised autoencoder algorithm to reconstruct things What we found was that at the highest levels of this vision model we actually had neurons | |
[06:15.900] | |
that had sort of learned very high-level features, even though they'd never been told what a | |
[06:21.640] | |
cat is. | |
[06:23.000] | |
This is the response that would optimally activate a particular neuron at the highest | |
[06:27.700] | |
level. | |
[06:28.700] | |
And so essentially, if you show a unsupervised algorithm | |
[06:32.880] | |
lots of YouTube video frames, it will learn about cats, | |
[06:35.940] | |
which was nice. | |
[06:37.860] | |
So in fact, we actually got a 70% relative improvement | |
[06:42.820] | |
on the harder ImageNet benchmark | |
[06:46.220] | |
with 22,000 categories instead of 1,000 | |
[06:49.700] | |
using this unsupervised algorithm as a pre-training method | |
[06:55.080] | |
and then did supervised training | |
[06:56.720] | |
on the smaller data set of 22,000, labeled 22,000 images. | |
[07:03.920] | |
So that was a sign that if you really push the scale | |
[07:06.500] | |
of these models, you can actually learn | |
[07:08.840] | |
pretty interesting things. | |
[07:10.040] | |
Unsupervised learning was a nice way to sort of be able | |
[07:13.460] | |
to train on lots of data because you may not have labels | |
[07:16.220] | |
for everything. | |
[07:18.700] | |
We also kind of realized | |
[07:19.368] | |
of realized that you could do this in the text domain with building these kind | |
[07:24.708] | |
of distributed representations of words and those turned out to be quite | |
[07:28.728] | |
powerful where you now have the ability to have a vector representation that | |
[07:33.608] | |
represents kind of the meaning or the context in which a word appears and then | |
[07:37.868] | |
you can use that as the representation for the word rather than kind of the | |
[07:42.948] | |
syntactic characteristics of the word and we discovered a couple of interesting | |
[07:47.628] | |
things. So when you have like 100 or 1000 dimensional representations for words, it | |
[07:52.848] | |
turns out that obviously nearby words in this high dimensional space are related. So you | |
[07:57.048] | |
get cat and puma and tiger are all kind of nearby in this high dimensional space, but | |
[08:02.148] | |
also that directions are meaningful. So if you go in a particular direction, that's the | |
[08:08.208] | |
same direction you go to turn the word king into queen as you turn man into woman as you | |
[08:14.688] | |
turned bull into cat. Quark will probably talk about this in more detail but | |
[08:23.688] | |
sequence-to-sequence learning turned out to be another important innovation where | |
[08:27.948] | |
you can essentially have an encoder that learns to accept input one token or one word at a time and then builds a distributed representation of that input And that can then be used to seed a decoding phase that can then predict a sequence | |
[08:44.016] | |
So you have an input sequence and an output sequence. | |
[08:47.016] | |
The easiest framing for this is translation, where you have an input sentence in one language | |
[08:51.096] | |
and then you want to decode the corresponding sentence in another language that means the same thing. | |
[08:56.936] | |
And you just have lots and lots of training data that is matched pairs of English sentence and French sentence | |
[09:04.256] | |
And if you scale up LSTMs, it works better and better | |
[09:10.856] | |
Around 2013 | |
[09:13.436] | |
I'd actually done a little bit of an expert sort of a thought experiment because we were seeing really really good results with | |
[09:21.136] | |
neural speech based models and also neural vision models and we | |
[09:25.536] | |
So I did an experiment, sort of a mental exercise, | |
[09:29.956] | |
where I said, okay, what if 100 million Google users | |
[09:33.196] | |
want to now start talking to their phone | |
[09:34.976] | |
for three minutes a day? | |
[09:36.416] | |
And at that time, the models were big enough | |
[09:38.896] | |
that you couldn't run them directly on the phone. | |
[09:41.796] | |
So we'd have to run them in our data centers. | |
[09:43.856] | |
And I had said, okay, well, if they do that, | |
[09:45.824] | |
If they do that, here's what that would cost. | |
[09:48.564] | |
And it turned out we would need to double the number of computers Google had just to roll out this neural speech encoder we developed that was much higher quality. | |
[09:57.344] | |
And we wanted to do that, but it seemed sort of infeasible to double the number of data centers we had. | |
[10:03.404] | |
And so I got together with some of my colleagues who thought more about hardware than I did. | |
[10:08.964] | |
I said let's just build some specialized hardware for doing the kinds of computations we want to do for machine learning only | |
[10:15.724] | |
and not for the general purpose things and | |
[10:18.784] | |
So that was the genesis of our tensor processing units | |
[10:23.084] | |
And the first one we built was just a single card that you see in the picture there | |
[10:27.004] | |
It was really designed just for inference. So it did 8-bit integer arithmetic only | |
[10:32.684] | |
it had basically a matrix unit that was | |
[10:35.204] | |
was quite good at doing matrix operations and it was very heavily pipelined as a systolic | |
[10:41.764] | |
array. | |
[10:42.764] | |
And what we found after we actually built this was that the TPU was 15 to 30x faster | |
[10:50.584] | |
than contemporary CPUs or GPUs for these kinds of computations and 30 to 80x more energy efficient So that was like a massive leap forward That something that really jump our use of more specialized hardware for these kinds of models | |
[11:06.632] | |
This one was for inference, and then the subsequent generation of CPUs, | |
[11:10.992] | |
V2 and onwards, really was targeting both inference and training. | |
[11:17.732] | |
By the way, this is now the most cited paper | |
[11:19.932] | |
in the International Exposium on Computer Architecture's history. | |
[11:24.552] | |
2016, that was when we developed the second generation of TPUs, which is really how do | |
[11:34.472] | |
you build not just a single card that does inference for a particular model, but how | |
[11:38.532] | |
do you build a very large scale connected system of many of these chips in order to | |
[11:43.372] | |
do training at high speed. | |
[11:46.732] | |
And so one of the things that TPU pods have is they have a whole bunch of chips and then | |
[11:51.512] | |
they're connected together with a custom high-speed interconnect usually either a | |
[11:56.512] | |
2d or a 3d mesh with wraparound links so it's a 3d torus typically and that has | |
[12:05.172] | |
really helped us really push forward the scale of models we can train so our | |
[12:09.752] | |
latest generation pods are now you know | |
[12:12.280] | |
I think 9,000 chips. | |
[12:14.180] | |
So the Ironwood pod, which we announced this year, | |
[12:17.800] | |
is 9,216 chips and offers 45 exaflops of compute in those chips. | |
[12:25.100] | |
And that's really 3,600, the peak performance of the TPUv2 pod | |
[12:30.720] | |
that we introduced in 2017, I think. | |
[12:39.540] | |
And it's also much more energy efficient | |
[12:41.720] | |
because we've been working hard on how do we make these systems as efficient as possible | |
[12:47.560] | |
for the training tasks we care about. | |
[12:55.260] | |
The other thing that has really happened in the community is that open source tools | |
[12:59.140] | |
have really enabled many, many organizations and people to sort of participate, | |
[13:03.240] | |
to use the same tooling, to develop models, to kind of contribute to the underlying tooling | |
[13:10.440] | |
in an open source ecosystem. | |
[13:13.740] | |
And the three main ones here that I'll highlight | |
[13:16.380] | |
are TensorFlow, which we kind of developed | |
[13:19.640] | |
as our second generation training system compared to the first one that we built in the first four years of our Google Brain effort where we sort of took the lessons from the first system | |
[13:31.188] | |
called Disbelief which was not open sourced and then we decided we would | |
[13:34.328] | |
open source the second generation system and improve all the things we didn't | |
[13:39.848] | |
like about the first system and a little bit after that PyTorch came along that | |
[13:46.868] | |
that sort of adapted the torch system in Lua | |
[13:51.108] | |
to use Python instead of Lua. | |
[13:52.988] | |
And I think that was a big sort of thing | |
[13:54.788] | |
that sort of opened it and made it more accessible | |
[13:57.168] | |
to a lot more people, | |
[13:58.068] | |
because there's a lot more Python programmers | |
[13:59.668] | |
than Lua programmers. | |
[14:02.488] | |
And then JAX has been a really nice kind of very clean, | |
[14:05.128] | |
elegant abstraction for certain kinds | |
[14:08.708] | |
of machine learning workloads with auto diff built in | |
[14:11.788] | |
as kind of one of the key things that works on. | |
[14:14.188] | |
Okay, so like the sequence to sequence work, which really pushed forward the model architectures | |
[14:23.368] | |
that we used for a lot of the kinds of language modeling tasks, in 2017 a group of my colleagues | |
[14:31.368] | |
came up with this observation that when you have an LSTM and you're doing the kind of | |
[14:38.736] | |
you're taking the state in the LSTM, | |
[14:41.056] | |
you're getting the next input word or token, | |
[14:44.236] | |
and you're advancing the LSTM step | |
[14:45.956] | |
with the recurrent state update | |
[14:49.336] | |
to get the new state after this word, | |
[14:51.916] | |
that has a very sequential dependency on the steps | |
[14:56.416] | |
because you need to process that word | |
[14:58.156] | |
before you can go on to the next word, | |
[14:59.356] | |
before you can go on to the next word. | |
[15:01.556] | |
And it also has the problem that you're trying to cram | |
[15:05.096] | |
all the information you went through | |
[15:07.396] | |
in order to get to this state into a single vector. | |
[15:11.456] | |
And so the two observations that really led to this | |
[15:14.356] | |
attention is all you need transformer breakthrough | |
[15:16.776] | |
were let's not try to force all of our state | |
[15:20.516] | |
into a single vector that we update | |
[15:21.956] | |
and throw away the old vector. | |
[15:23.576] | |
Let's keep all the state we go through, | |
[15:25.896] | |
save all the past representations, | |
[15:27.516] | |
and then we can attend to those | |
[15:29.436] | |
when we're sort of trying to generate the next token. | |
[15:31.756] | |
This also means that we can generate them all in parallel, | |
[15:34.536] | |
which is kind of nice. | |
[15:35.596] | |
So we have this nice, really highly parallelizable architecture compared to the LSTM that was | |
[15:42.676] | |
kind of the state of the art at that time. | |
[15:45.776] | |
And you can actually see this in the data here You actually got higher accuracy with 10 to 100 times less compute and 10 times smaller models So you know a lot of people think a lot of the improvements are about just computing | |
[16:01.964] | |
raw scale and raw data sizes that we're training on, but architectural improvements like the | |
[16:07.764] | |
transformer make huge differences. | |
[16:09.264] | |
All of a sudden we're now one or two orders of magnitude better for the same amount of | |
[16:14.664] | |
compute flops invested than we were before. | |
[16:19.024] | |
So don't give up on new architectures and good ideas | |
[16:22.724] | |
because they make a huge difference. | |
[16:28.244] | |
The other thing that happened was people started to realize | |
[16:31.784] | |
that you could do language modeling at scale | |
[16:33.664] | |
with self-supervised data. | |
[16:35.624] | |
There's tons and tons of texts in the world | |
[16:37.864] | |
and self-supervised learning on this text | |
[16:40.084] | |
can give you lots of training data | |
[16:42.204] | |
where the right answer is no | |
[16:43.784] | |
because you essentially take all the text, | |
[16:46.164] | |
you hide little bits of it in some way, | |
[16:48.424] | |
and then you can have the model try to guess | |
[16:51.004] | |
what the answer is, and you have the right answer | |
[16:53.504] | |
because you have the actual text. | |
[16:55.864] | |
This has been hugely important because now you get | |
[16:58.104] | |
this really rich training signal of like, | |
[17:00.404] | |
well, I didn't get that right, or I did get that right, | |
[17:02.944] | |
and the model learned very quickly from that. | |
[17:05.164] | |
So different kinds of | |
[17:05.192] | |
So different kinds of training objectives. | |
[17:07.072] | |
There's autoregressive, where you look at the prefix of words | |
[17:09.712] | |
that came before you, and then you try to guess | |
[17:11.732] | |
what the next word is. | |
[17:13.152] | |
That's really useful, and that has | |
[17:14.612] | |
been the core of some of the most important models | |
[17:18.232] | |
that we have today. | |
[17:22.372] | |
And then there's fill in the blank, | |
[17:23.992] | |
where you hide some fraction of the words, | |
[17:26.372] | |
and the model is able to look around at the context | |
[17:28.832] | |
and try to guess the missing words in that. | |
[17:34.112] | |
And both of those have been quite useful for slightly different types of problems. | |
[17:41.392] | |
In 2021, people started to look at how can you use Transformer for vision problems as well. And the | |
[17:48.832] | |
important aspect of this was like in text, the vision system using this Transformer architecture | |
[17:56.672] | |
actually scales really well. | |
[17:57.872] | |
So you actually get much, much better results, | |
[18:03.572] | |
but with much less computation. | |
[18:05.452] | |
Again, probably a factor of 10x reduction in computation | |
[18:08.232] | |
needed to get to a given level of accuracy | |
[18:11.452] | |
compared to the model on the right there Sparse models are again another big architectural innovation So this is some work that Kwok and I and others did where we basically said we | |
[18:29.920] | |
want to have a very very large model with a tremendous amount of capacity to | |
[18:33.600] | |
remember large amounts of training data but we don't want to pay the cost of | |
[18:39.000] | |
of activating the entire model for every token | |
[18:42.480] | |
or example or word. | |
[18:43.900] | |
So instead, we're gonna have a large capacity, | |
[18:47.240] | |
but we're gonna learn different experts to call upon. | |
[18:50.660] | |
So think of every layer as now being 2,000 separate experts, | |
[18:55.160] | |
and we're gonna learn to activate | |
[18:57.100] | |
just a handful of those experts, | |
[18:58.420] | |
one or two or eight or something like that. | |
[19:01.920] | |
And what this means is that you now have | |
[19:03.940] | |
this tremendous capacity, but you only pay | |
[19:06.620] | |
a tiny fraction of the overall cost in total parameters | |
[19:09.960] | |
to activate the fraction of the model. | |
[19:12.360] | |
So you get an 8x reduction in training compute cost | |
[19:14.780] | |
for the same accuracy. | |
[19:16.620] | |
That's the difference between M and L there. | |
[19:20.060] | |
Or you could choose to spend that | |
[19:22.780] | |
and get a much more accurate model | |
[19:24.480] | |
for the same amount of training flops on the x-axis there. | |
[19:29.500] | |
So this has been quite useful and most of the models | |
[19:31.648] | |
most of the sort of most advanced models that you see in the market today are based on some | |
[19:36.508] | |
form of sparsity. | |
[19:37.508] | |
And there's a whole litany of sort of subtle design choices in how do you do the routing, | |
[19:44.108] | |
how does the gating model learn, which experts are most useful, do you pick one or do you | |
[19:50.148] | |
pick two, and so on. | |
[19:52.108] | |
But the basic idea is how do you have a very, very large sparse model, activate only a partial | |
[19:56.428] | |
part of it and then use that for a great advantage to make bigger models that are computationally | |
[20:03.548] | |
efficient. There's been a whole line of continued research on sparse models and our Gemini models | |
[20:11.108] | |
use a mixture of experts architectures building on this long line of efforts. | |
[20:21.268] | |
One of the other things we focused on was how do we build the right abstractions for doing | |
[20:25.968] | |
very large scale distributed ML computations. And as an ML researcher, you don't really | |
[20:31.888] | |
want to think about the physical hardware that's underneath you. You want to think in | |
[20:36.728] | |
terms of like okay I have this cool algorithm and I want it to run on 10 chips please You don really want to know intimately all 10 of those chips and know where they are and you know which data center they in and all this | |
[20:51.196] | |
kind of thing. So one of the things we wanted to build was a system that could abstract | |
[20:55.696] | |
that away from you. And so that we built a system starting in maybe 2017 I guess called | |
[21:02.756] | |
pathways in 2018 that really is meant to take large collections of chips and kind of give | |
[21:12.656] | |
you the illusion that you're addressing just that single large collection. | |
[21:16.976] | |
And underneath the covers, what it will do is use kind of the most efficient networking | |
[21:21.396] | |
transport protocols depending on where the chips are. | |
[21:25.336] | |
So if you need, like, say the chips in just one of these squares need to talk to each | |
[21:30.476] | |
and they're all running in the same TPU pod, | |
[21:32.776] | |
it'll use the really high speed TPU network | |
[21:36.236] | |
to communicate between those chips. | |
[21:37.776] | |
But if you then need to communicate to chips | |
[21:40.156] | |
in an adjacent pod that doesn't have | |
[21:41.996] | |
this sort of custom network, | |
[21:43.496] | |
it'll use the data center network within that building. | |
[21:46.876] | |
And if you need to talk to a pod across buildings, | |
[21:49.836] | |
we'll use the purple link there to communicate | |
[21:52.336] | |
between the pods in place, building one and building two. | |
[21:57.456] | |
And you can even extend this to try | |
[21:58.104] | |
You can even extend this to train in multiple regions in, say, different parts of the United States | |
[22:04.024] | |
with the large red kind of long latency, long distance. | |
[22:10.604] | |
And what Pathways does, and then Jax built on top of the Pathways client, | |
[22:17.764] | |
is the entire training process can be driven by a single Python process on one host. | |
[22:22.704] | |
So you get this illusion that your Python process, instead of having four chips like a normal JAX process running on a single TPU machine, instead has 10,000 chips. | |
[22:34.524] | |
And you can talk to those different devices and say, I want to run this computation on these 64 devices and this other computation on these 64 devices and communicate the results between them. | |
[22:44.844] | |
And Pathways takes care of managing all the details. | |
[22:48.104] | |
if one of those underlying hardware machines goes down, | |
[22:53.244] | |
it will go ahead and reschedule it and bring it back up | |
[22:57.164] | |
and sort of make this all transparent to the client. | |
[23:02.504] | |
Okay, Denny will talk about this in much more detail, | |
[23:05.744] | |
I think but really another thing that has really happened is doing inference time compute to get better answers So instead of just trying to sort of take the model ask it what the answer is for a | |
[23:19.432] | |
particular problem, if you instead coax it in lots of different ways to spend more compute | |
[23:26.552] | |
time, essentially using more inference time floating point operations to arrive at the | |
[23:32.272] | |
right answer, you can in many cases actually get substantially improved | |
[23:37.372] | |
performance out of the same underlying model. And one of the key things that | |
[23:41.272] | |
Denny and others came up with, and Kwok and others came up with, is this | |
[23:44.872] | |
chain of thought prompting where you elicit the model to instead of just | |
[23:48.892] | |
trying to give you the answer, you, like your fourth grade math teacher, try to | |
[23:53.512] | |
get the model to work through the sequence of steps that you got to take, | |
[23:58.132] | |
that you took, in order to arrive at the answer. And so if you demonstrate that to | |
[24:02.112] | |
to the model as, you know, in the input problem we've given the model one example of working | |
[24:10.492] | |
through this. | |
[24:11.492] | |
You know, Sean started with five toys. | |
[24:13.492] | |
Instead of just trying to say the answer is nine, you know, your fourth grade math teacher | |
[24:17.192] | |
would be very proud. | |
[24:18.192] | |
You've now worked through exactly how you should work out the problem. | |
[24:22.232] | |
And then the model actually will model this behavior, as it were. | |
[24:24.560] | |
behavior, as it were, and do this on subsequent problems. | |
[24:29.220] | |
And this actually gives you a substantial improvement | |
[24:31.540] | |
in accuracy in these sort of mathematical-oriented problems. | |
[24:35.780] | |
And this general idea of using a lot more compute | |
[24:38.560] | |
and think at a sort of inference time | |
[24:41.640] | |
is the source of a lot of improvements in the last sort | |
[24:44.320] | |
of four or five years in these models. | |
[24:46.500] | |
It really has made it so that they have yet another dimension | |
[24:50.640] | |
we can scale on. | |
[24:51.720] | |
We could previously scale on training flops and model size and so on, but now we can take | |
[24:57.880] | |
the same model and choose to scale inference compute and get better answers. | |
[25:06.520] | |
Another really important technique for scaling and particularly for serving really capable | |
[25:11.280] | |
models to make them sort of lighter weight is distillation, which I and Jeff Hinton and | |
[25:17.400] | |
and Oriel Finiol's kind of worked on in 2014. | |
[25:23.020] | |
And the idea is really you want to have, | |
[25:26.500] | |
suppose you have a really good neural network, | |
[25:28.200] | |
but it's kind of expensive. | |
[25:31.420] | |
You would like to be able to get the knowledge of that you know the model knows a lot of stuff So how could you get what that model knows transferred into a lighter weight model And really if you use next token prediction | |
[25:44.968] | |
for training, you know, perform the concerto for blank, you know, the model will try to predict | |
[25:49.688] | |
what the next word is. Maybe the real next word is violin. That's great. That's how we train these | |
[25:55.728] | |
models. But if you ask the teacher model what is the next word, it says, okay, well, there's | |
[26:04.468] | |
probability, you know, 0.4 that it's violin, 0.2 that it's piano, 0.01 that it's a trumpet, | |
[26:11.328] | |
because trumpet concertos don't happen as often as violin and piano concertos, but airplane | |
[26:15.748] | |
concertos really, really don't happen very much at all. And so using that probability | |
[26:21.348] | |
distribution, you actually get this really rich gradient signal to train the | |
[26:25.848] | |
student model from the things that the teacher model knows. You actually get | |
[26:30.108] | |
much more information than just the raw text that you're training out. And so | |
[26:35.268] | |
using that rich signal and propagating that back through the whole set of | |
[26:39.248] | |
layers in the model really gives you quite impressive improvements. And in the | |
[26:44.628] | |
original paper, you know, we looked at a baseline of, this is a speech problem, so | |
[26:49.668] | |
So we're trying to predict what sound | |
[26:51.016] | |
what sound is being uttered in a training frame of audio. | |
[26:55.716] | |
The training frame accuracy was 63.4 | |
[26:58.136] | |
and the baseline without distillation | |
[27:00.196] | |
using 100% of the training set. | |
[27:02.536] | |
And the test frame accuracy was 58.9. | |
[27:06.396] | |
If you use the baseline approach without distillation | |
[27:09.716] | |
but you only use 3% of the training data, | |
[27:12.536] | |
then your test frame accuracy plummets 44% | |
[27:17.416] | |
because you've only used 1 33rd of the data. | |
[27:21.016] | |
But if you use soft targets from distillation, all of a sudden you get this incredible amount of information from just the 3% of the training examples. | |
[27:30.636] | |
And your test training accuracy is almost as high as using 133 times more data in the original model. | |
[27:37.516] | |
So this is a hugely useful technique for making really small models work really well if you have a big model that works really well. | |
[27:46.416] | |
Okay, so I've gone through a whole bunch of different things. | |
[27:55.156] | |
There's been innovations at lots of these different levels, and all of these things | |
[27:58.436] | |
are really really important to make the models that we have today come together Improved inference algorithms improved training algorithms improved model architectures software attractions so that we can all express things that we want efficiently | |
[28:11.184] | |
and improved hardware. | |
[28:15.264] | |
Okay, so Gemini is kind of where we really, | |
[28:18.744] | |
as a company, came together and we combined lots of people | |
[28:22.004] | |
who were working on these problems somewhat separately | |
[28:24.724] | |
into a single effort. | |
[28:25.844] | |
So people from the Google Brain team, | |
[28:27.764] | |
from Google Research and from what we call Legacy Peep Mind | |
[28:31.524] | |
came together to form the Gemini effort. | |
[28:35.244] | |
And how do we put these advances together? | |
[28:38.444] | |
So in February, 2023, we started with collaborators | |
[28:41.804] | |
from all these different places | |
[28:43.204] | |
and other parts of Google as well. | |
[28:44.944] | |
And the goal was really, | |
[28:45.804] | |
how do we train the world's best multimodal models | |
[28:49.044] | |
and then use them all across Google? | |
[28:51.724] | |
So we had a 1.0 release in December, 2033. | |
[28:55.764] | |
I followed that up fairly quickly with a 1.5 release in February 2024 where we showed much longer context capability. | |
[29:04.764] | |
We had our first flash model which was distilled from a larger scale model in December of that year where we showed that the 2.0 flash model was as good as the 1.5 pro model. | |
[29:17.472] | |
pro model. So one of the things we've been really happy with so far is every | |
[29:22.232] | |
generation we've been able to make the much lighter weight and cheaper model | |
[29:26.512] | |
not only way cheaper but also as good in quality terms or even slightly better | |
[29:33.732] | |
than the previous models large scale model. And I think that's really helped you know | |
[29:40.212] | |
adoption of these things because sometimes the expensive model makes | |
[29:43.332] | |
sense for a particular application but if you have a model that's ten times | |
[29:46.452] | |
cheaper. There's just way more things you'll find that you can put the model to use for. | |
[29:52.212] | |
And then we sort of showed a thinking model that could sort of use inference time compute | |
[29:58.372] | |
to really scale and get better answers. And then the 2.5 series in March where we released 2.5 Pro | |
[30:06.852] | |
and then the 2.5 Flash model in April and then in August the 2.5 Pro DeepThink model | |
[30:13.252] | |
which uses a lot more reasoning capability in order to solve harder problems. | |
[30:21.852] | |
So one of the things we wanted was the Gemini models to be multimodal from the start because | |
[30:25.692] | |
we felt like focusing on text models really wasn the right thing You want models to be able to take in input in all different kinds of modalities because the world is messy and you have sometimes have text you sometimes have | |
[30:37.840] | |
images or audio or video input and you also want to be able to produce all of | |
[30:43.600] | |
those modalities as output and initially we didn't produce that many but over | |
[30:48.580] | |
time we're adding more and more native decoding modalities as well as well as | |
[30:55.660] | |
all the input modalities. The context length has been a big thing that we | |
[31:02.320] | |
focused on is how do we make the model able to take a lot of input data. A | |
[31:08.900] | |
million tokens is actually quite a lot it's like a thousand pages of text so | |
[31:13.740] | |
you can you know put in 20 or 30 research papers, you can put in a long | |
[31:18.220] | |
book, you can put in two hours of video or ten hours of audio or some | |
[31:22.740] | |
combination of those things and then the model can actually attend to that in a reasonable | |
[31:27.180] | |
way. And importantly, the information in the context window is quite clear to the model. | |
[31:32.900] | |
Unlike the, I like to say the training process is you take trillions of tokens of training | |
[31:39.420] | |
data and you stir it all together and it updates all the parameters of the model, but you really | |
[31:43.928] | |
kind of stirred it all together into hundreds of billions of parameters and | |
[31:48.668] | |
the model the information in the training data is really useful for the | |
[31:52.828] | |
models basic capabilities but it's not like crystal clear unlike the | |
[31:56.888] | |
information in context window you know Gemini 2.0 really took a lot of the | |
[32:05.688] | |
things I talked about and put put all those things together into sort of our | |
[32:09.968] | |
our next 2.0 series of models, TPUs, cross data center training, pathways, jacks, transformers, | |
[32:18.508] | |
sparse models, distillation, lots of other things. | |
[32:24.668] | |
And 2.5 Pro, when it was released, was a very big improvement in model quality for us. | |
[32:31.408] | |
We were really happy to see that. | |
[32:35.088] | |
And at the time of release, it was actually the top model in many, many different things, | |
[32:41.488] | |
and it still is in many of them. | |
[32:44.168] | |
So that's been really nice to see. | |
[32:46.328] | |
People have used that model for all kinds of things, and people can tell when they're | |
[32:51.688] | |
using a model that is better than models they used to They can really sort of experience that because it able to solve problems that they threw at these models before and they maybe weren able to do what they wanted | |
[33:05.116] | |
and now when you give them a model that's higher quality, | |
[33:08.076] | |
they try those things and sometimes that works | |
[33:10.016] | |
and that's really nice. | |
[33:14.936] | |
So one of the things we really want to do | |
[33:18.396] | |
is push the Pareto frontier of quality | |
[33:22.336] | |
on the y-axis and then cost on the x-axis. | |
[33:26.896] | |
And the cost here is a logarithmic scale. | |
[33:30.176] | |
So it's actually big cost differences don't look that big. | |
[33:34.716] | |
But we really wanna have a sort of series of models | |
[33:38.416] | |
that enable people to choose the right trade off. | |
[33:40.436] | |
This graph was accurate at the time of the release | |
[33:43.916] | |
of 2.5 Pro. | |
[33:46.616] | |
Sorry, I didn't update it for the talk. | |
[33:49.276] | |
But really, we want you to have really high quality models | |
[33:54.016] | |
that are at state of the art quality level | |
[33:57.656] | |
at a reasonable cost, but then we also want to make, | |
[34:00.316] | |
using distillation and other techniques, | |
[34:02.496] | |
really lightweight models that are really fast | |
[34:05.856] | |
and much cheaper than the state of the art models | |
[34:09.356] | |
and can be used for many more things. | |
[34:10.384] | |
and that's our flash series of models. | |
[34:15.224] | |
We also think it's important to have a set of open source models and | |
[34:20.564] | |
the reason being that | |
[34:22.264] | |
you know, there are a wide variety of reasons for wanting to have an open source model. | |
[34:26.764] | |
People want to be able to fine-tune the models on their own data. | |
[34:29.444] | |
They want to be able to run it in on-prem or other situations where they can't necessarily use a cloud-based API. | |
[34:35.804] | |
And | |
[34:37.804] | |
So we've worked on putting out a high-quality series of open source models, and we really kind of tried to focus on the sweet spots for developers, which are ones that are not huge resource footprints, because that's much more useful for most people, where they can run the model on a single higher-end GPU card rather than having 32 GPU cards or something. | |
[35:01.184] | |
And so really, GEMMA 3 is our open source models that are using sort of most of the | |
[35:08.484] | |
innovations from the Gemini 2.0 series in an open source footprint. | |
[35:13.944] | |
And the graph here shows you quality on the bars but also below it shows you the number of GPUs needed to serve that model So you have a model that quite high quality but it only a 27 billion parameter model and you can serve it on a single | |
[35:29.832] | |
GPU card, which has been a big hit with developers. It's been downloaded more than 200 billion times | |
[35:36.312] | |
since it was released, and supports 140 languages. That's another reason to have | |
[35:41.512] | |
a good multilingual model, because people can use it in many more circumstances. | |
[35:52.592] | |
We've also focused on, from a reasoning perspective, the mathematical domain. | |
[35:58.312] | |
And so last year in July, we worked on a model that could compete on IMO-level mathematical | |
[36:08.572] | |
problems. | |
[36:09.992] | |
And so in July last year, we presented alpha proof and alpha geometry for geometry-focused | |
[36:18.872] | |
problems. | |
[36:19.872] | |
And those systems together solved four out of the six problems from the 2024 IMO Olympiad | |
[36:26.172] | |
and got the same level as a silver medalist in the competition for the first time. | |
[36:31.312] | |
So this was like a really good litmus test for are these models capable of doing IMO | |
[36:36.812] | |
level mathematics. | |
[36:36.840] | |
But there were a few caveats. | |
[36:40.380] | |
So first, it was kind of two separate models and you'd say, oh, well this one looks like | |
[36:44.780] | |
geometry, we should give it to the geometry model, and this one doesn't look like geometry, | |
[36:48.960] | |
let's give it to the other model. | |
[36:51.060] | |
The IMO rules allow you to get the problems translated into the language of your choice | |
[36:57.440] | |
so that it's accessible to people who speak other languages. | |
[37:01.480] | |
We speak Lean, which is a formal language for mathematical proofs. | |
[37:07.120] | |
And so the starting point for last year's competition was Lean, human translated variants | |
[37:16.440] | |
of the problem translated into Lean so that we could then use that as a starting point | |
[37:21.000] | |
and then use sort of Lean verification of some of the problems. | |
[37:25.160] | |
And we used two days of computation on quite a lot of chips, whereas IMO competitors were | |
[37:30.360] | |
given for four and a half hours. But still, with those caveats, this system actually performed | |
[37:35.600] | |
pretty well and did a pretty good job in the 2024 IMO. | |
[37:41.360] | |
And this year in July I sure Denny or Guac will talk more about this or maybe Yi tackling more and more advanced math We actually had a single version single model that solved five of the six IMO problems perfectly | |
[37:59.608] | |
getting you 35 out of the possible 42 points and achieve goal-level performance. | |
[38:05.128] | |
We published the solutions. | |
[38:07.148] | |
The IMO president said, you know, we did it. | |
[38:10.468] | |
And the solutions were astonishing in many respects. | |
[38:12.788] | |
that found them to be clear, precise, and most of them easy to follow. | |
[38:16.548] | |
So that's pretty good. | |
[38:18.968] | |
And the differences versus last year, | |
[38:21.148] | |
the problems were solved by a single model | |
[38:22.948] | |
with deep thinking and reasoning capabilities. | |
[38:26.388] | |
The input was provided as informal math, not lean, | |
[38:29.728] | |
so we didn't do this funny translation into our language of our choice. | |
[38:34.588] | |
And all problems were solved within that four and a half hour time limit. | |
[38:38.588] | |
So, you know, although we went from silver to gold, | |
[38:41.328] | |
It's actually a bigger leap, I think, | |
[38:42.748] | |
because of all those caveats that we've removed. | |
[38:45.828] | |
Now it's the single model taking the exact problems | |
[38:48.468] | |
that the competitors also got and solving them | |
[38:52.608] | |
in the time frame. | |
[38:55.848] | |
But one of the things we wanted to do | |
[38:57.428] | |
was not build a specialized mathematical system. | |
[39:01.348] | |
So the 2024 IMO set of models, which | |
[39:03.296] | |
set of models and the lean theorem proving thing that would use lean in order to verify some of | |
[39:10.116] | |
the outputs from that model felt more like a specialized mathematical system. Whereas this | |
[39:16.336] | |
year's system, we really just wanted to have a single model that was good at solving these math | |
[39:20.836] | |
problems, but also would extend those capabilities to other domains, like other kinds of reasoning | |
[39:26.396] | |
or coding. And so one important technique in the DeepThink model variant that was developed | |
[39:33.056] | |
is really exploring many possible directions in parallel for solving a problem | |
[39:37.916] | |
so that you can sort of have this parallel exploration, | |
[39:41.276] | |
use more inference time compute in order to sort of come up with better answers | |
[39:45.436] | |
for the problem or more sort of exploratory directions for solving the problem. | |
[39:51.696] | |
And you see this in a couple of the different benchmarks that are not math related. | |
[39:56.316] | |
There's humanity's last exam, which is something with a whole bunch of different kinds of problem | |
[40:02.856] | |
domain areas and compared to Gemini 2.5 Pro, Gemini 2.5 with the deep thinking more inference | |
[40:10.756] | |
time compute actually makes a significant improvement in the capability and the score of that And similarly in coding the DeepThink model is a lot better than the base model | |
[40:28.284] | |
The other thing about these multimodal models, | |
[40:30.644] | |
and as we start to put more multimodal generation | |
[40:35.684] | |
on the back end of the Gemini models | |
[40:37.944] | |
so that it can generate not just text, | |
[40:39.784] | |
but also generate images and have conversations | |
[40:41.904] | |
about images you're generating and input images and text | |
[40:46.564] | |
combined in order to have you naturally work | |
[40:52.884] | |
with all the different modalities you care about, | |
[40:55.344] | |
you can see we have an update to our image generation | |
[41:02.924] | |
model called NanoBanana as a code name on the web | |
[41:06.704] | |
before it was actually released into the not as excitingly | |
[41:09.904] | |
named Gemini Flash 2.5 Advanced Image Generation and Editing. | |
[41:16.524] | |
Oh, thank you. | |
[41:18.104] | |
But you can sort of take two images, | |
[41:20.044] | |
like the mountain and the nice whales there, | |
[41:22.904] | |
and turn it into whales jumping by Mount Everest. | |
[41:27.704] | |
You can ask, you can give it an input image, | |
[41:29.752] | |
image and then ask it to imagine what's happening next. | |
[41:34.752] | |
And in visual form, it can imagine | |
[41:37.912] | |
what happens next, which seems like one pretty probable | |
[41:42.992] | |
scenario. | |
[41:45.952] | |
You can create your business cards. | |
[41:52.872] | |
Yeah, I mean, these models are kind of fun | |
[41:54.452] | |
because you can sort of imagine things that don't really exist. | |
[41:58.292] | |
And the quality of these models is really getting better. | |
[42:01.252] | |
I think if I click this, it will. | |
[42:03.152] | |
So this is a video where you're given an image of a box, | |
[42:08.792] | |
and then there's a text prompt, which | |
[42:10.492] | |
is not shown in the video. | |
[42:12.412] | |
But it's lots of different text prompts, | |
[42:13.952] | |
so it sort of shows you what happens | |
[42:16.792] | |
when you take this image of a box | |
[42:19.052] | |
plus an appropriate text prompt. | |
[42:20.612] | |
A bunch of different things you can do with that. | |
[42:22.612] | |
And if we can play the video, I don't know if I click, | |
[42:24.852] | |
will it play it? | |
[42:26.092] | |
If I click, it'll play. | |
[42:27.172] | |
Great. | |
[42:27.672] | |
Okay, so there you go. There's our box. | |
[42:38.480] | |
I think that last one's my favorite. | |
[42:58.780] | |
Okay. | |
[42:59.660] | |
So you can see, like, these are interesting creative tools | |
[43:02.740] | |
because you can sort of take things in the real world | |
[43:04.960] | |
but then sort of a user imagination and text prompting to kind of get things that are kind of real but kind of not to happen. | |
[43:13.380] | |
And I think you're seeing kind of a new generation of creative filmmakers or video makers using these tools to sort of create little short clips. | |
[43:23.100] | |
You know, we've seen some five minute videos created with a whole sequence of, you know, | |
[43:28.920] | |
eight or 10 second generations to create a whole story | |
[43:33.600] | |
of some interesting set of characters | |
[43:37.200] | |
that the filmmaker has made up. | |
[43:39.500] | |
And this is gonna dramatically make more accessible | |
[43:44.840] | |
really high quality video generation to more people. | |
[43:47.740] | |
Because normally if you're making a movie | |
[43:49.480] | |
or something like that, you have to get a cast | |
[43:51.820] | |
and you have to go shoot on location. | |
[43:53.680] | |
And all of a sudden now with like $500 with the prompting, | |
[43:56.208] | |
prompting, you can do what previously would have been tens of thousands or $100,000 of | |
[44:03.728] | |
effort to go out and actually physically create that. | |
[44:06.448] | |
And you can even do effects that would be really hard to create otherwise. | |
[44:11.408] | |
Okay. | |
[44:13.648] | |
There's been a bunch of discussion about the environmental impact of AI on both training | |
[44:19.488] | |
and more recently on inference as more and more people are using these models, they want | |
[44:24.448] | |
know what is the sort of environmental impact of my typing in a Gemini prompt and having | |
[44:30.508] | |
it respond. | |
[44:32.988] | |
And there's actually been some fairly, some work that's looked at that in various contexts, | |
[44:42.228] | |
but it's made a bunch of assumptions that aren't really very realistic for how these | |
[44:47.128] | |
models are actually deployed and used. | |
[44:48.608] | |
So in particular the past work by outsiders often uses batch size one, which is hugely | |
[44:55.608] | |
inefficient because all of a sudden you bring in all your parameters from HBM in order to | |
[44:59.968] | |
do batch size one. | |
[45:02.508] | |
Computation doesn consider quantization which is again another multiple integer factor for performance improvement Doesn consider techniques like speculative decoding which means you have a very small model that trying to predict the next four or eight tokens and then your large | |
[45:18.016] | |
model in one pass verifies that rather than making sequential passes through | |
[45:22.716] | |
the large model for every one of those four or eight tokens instead you | |
[45:26.776] | |
generate eight tokens you have the large model verify how many of them does it | |
[45:31.036] | |
agree with and then you say okay great I agree with the first five of those | |
[45:34.816] | |
So I'm going to just generate those five with one pass through my large model and then advance and now have my tiny | |
[45:42.896] | |
decoding model generate another eight tokens. | |
[45:45.836] | |
And that again is a multiple integer factor performance improvement. And then it also, but some of this work also | |
[45:54.116] | |
misses some costs. So it doesn't take into account the full set of energy costs, things like the host machines that the accelerators are on, | |
[46:01.696] | |
on idle machines you actually need for a real production setting. | |
[46:06.496] | |
So we wanted to actually take a look at what is the actual inference cost of Gemini models | |
[46:13.156] | |
in our production environment. | |
[46:15.016] | |
And so that's what this paper that we released a couple weeks ago, I guess, looks at. | |
[46:21.116] | |
And so you can see some of the previous work | |
[46:22.664] | |
of the previous work and the different kinds of methodology that they used. And so most of them | |
[46:29.144] | |
focus on the accelerator power, usually with GPUs, so either A100s or H100s typically. The utilization | |
[46:38.664] | |
they look at, you know, some of these look at just batch size one, some look at larger batch sizes. | |
[46:46.584] | |
I forget exactly, I don't think most of them look at quantization and they don't look at specularity | |
[46:51.464] | |
decoding and then they also generally don't include the CPU and host machine costs, energy costs or | |
[46:59.544] | |
idle machines or other overhead in the data center power distribution system. So we wanted to look at | |
[47:06.104] | |
all of those things and see what we came up with. And this is the answer and so basically |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment