Created
June 23, 2015 18:22
-
-
Save BigEd/4fa7fec9305b595e84dc to your computer and use it in GitHub Desktop.
Robert Tomasulo transcript
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Transcript of the talk at https://www.youtube.com/watch?v=S6weTM1tNzQ | |
Thank you very much for that kind welcome! | |
What I intend to do with my time today | |
is divide it into two pieces | |
not necessarily equal. | |
The first piece will be a very cursory | |
examination of 20, 30 years of computer design | |
from the model 91 on forward to when it finally got replaced | |
as an out-of-order execution machine | |
and then I'll spend the rest of my | |
time answering your questions. | |
By the way interrupt me any time you have a | |
question or a comment that you think is appropriate. | |
So, to come way ahead, to roughly 1990 or thereabouts | |
IBM finally brought out an out of order machine | |
and this was such a shocking thing that | |
they felt it necessary to offer | |
an explanation or two as to why it | |
took them 30 years to do this | |
and they did | |
and the main explanation which was | |
quite a good one | |
was that once you have a cache | |
with a very rapid access to memory | |
(not counting misses of course which we'll discuss a little bit later) | |
and you don't have to worry about floating point | |
and you don't care about long execution | |
instructions like floating point then there's really no need | |
for what the model 91 offered | |
and so naturally they got rid of it. | |
But as time marched on it became apparent that | |
they were going to need something. | |
Now, IBM moves in mysterious ways - | |
they probably move a lot slower than | |
I would have liked or other people would | |
have liked but ultimately they get it done. | |
So they brought out this machine | |
and they provided a reasonable rationale | |
which the only flaw in it was that as time went on | |
(because we're talking now about a 30 year span) | |
it became less and less applicable | |
to think that way because memories kept getting slower | |
as they always have | |
I mean you can think of it as kind of a Golden Age | |
of the cache | |
that we had this period where you could get away with | |
two cycle access to main memory | |
and Gee, if you missed the main memory you had like | |
five cycle access to whatever backed it up | |
instead of, in today's machines, | |
you've got 70 cycle access to whatever backed it up. | |
5:14 | |
so | |
after... let me get straight what I want to say... | |
I'm gong to jump around somewhat... | |
So with all these caveats we didn't do an out of order machine | |
and maybe we could have a little bit sooner | |
than otherwise. | |
Now we jump back to the Model 91 timeframe | |
and why didn't they do an OOO execution machine then? | |
Well, they did, they did. It was a machine - I forget | |
what the nomenclature was now - it was a full OOO | |
machine. It was completely logically designed | |
it wasn't physically designed. So it's hard facts. | |
And, it featured some very nice innovations. | |
Including a branch prediction table which was | |
a new, a relatively new thing. | |
What else did they have? They had something in there specially | |
for RAS purposes, it's slipped my mind, but no-one cares | |
about RAS anyway. Although we should. | |
So, this machine was carried through to everything | |
except final physical design. It was good as gold, it | |
was completely debugged and conformed to the | |
architecture and all that other good stuff. | |
It was deemed that it was too expensive | |
which it probably was. | |
By the way, it had a 7ns cycle, which for | |
this period of time was pretty good. Pretty good. | |
Contemporary IBM machines were like 40, 50ns | |
cycle at that time. | |
7:36 | |
So, now we have nothing. We don't have the model 91 any more | |
and we didn't bring out a successor machine. | |
So in a sense nothing happened, from that point of view, for machine | |
design, for some 30-some odd years. Which was sad, I think | |
but that's OK. | |
> Any questions? | |
... just leapfrog to the rest of my talk. Or rather, the rest of your talk. | |
I don't have to do too much talking - as you see I don't have too much to | |
do because IBM made it too easy, right? They made one - I wouldn't | |
say it was a halfhearted attempt - they made a good attempt, to make | |
a really good machine, but it wasn't on the cards, and that was it. | |
Then 30 years go by, and then it's in the stars, and they can start over | |
again. | |
So that's really the main thing I wanted to cover. Obviously I can cover | |
more things if you want, obviously machine design didn't stand still. | |
During this time there improvements in RAS, and all kinds of things. | |
But for our purposes I didn't think I'd want to devote too much | |
time to those things. | |
9m40 | |
> Now onto questions. | |
[The questions are difficult to hear clearly enough to transcribe] | |
[Roughly when was this OOO design done?] | |
My memory isn't that good... early 70s, 72. | |
[Something about instructions] | |
Yes, in a limited way. What it did was classify the instructions | |
into fixed point, floating point, and decimal. And one instruction | |
in each class could be executed along with an instruction | |
from another class. | |
So it didn't rely really heavily on OOO, because you have to | |
remember that it's still true that, with the cache, a lot of the | |
gain of OOO evaporates. So there's not much point pursuing | |
it just for the sake of pursuing it. | |
[How large a team of people] | |
Well, the model 91 was special in the sense that they did | |
a whole new technology, they did a whole new design | |
automation system, and they did a whole new machine. | |
So they had a lot of change on their plates. | |
[How many people?] | |
The model 91, because it was developing the design | |
automation system and the software and everything, it had a lot of | |
people, altogether. It didn't have that many, if you focus in on | |
the designers, people like I was, there might have been | |
twenty, maybe, order of magnitude. | |
[laughter] | |
Why is order of magnitude funny? You think I'm trying to | |
hide a hundred? I wouldn't do that! | |
[Most trouble in 91?] | |
We had trouble until we discovered the OOO algorithms, | |
that cleared up one source of trouble. | |
The whole machine was stretching. | |
It was a strange machine because it had | |
really pitiful memory access. This was before | |
caches, so it had like 10 cycles, minimum | |
memory access. Now of course it had 16-way interleaved | |
memory, so you're not necessarily waiting 10 cycles | |
for every single access, but it's | |
pretty pitiful. | |
So memory was a big bottleneck for that machine. | |
When they finally brought out... part of the 90 line... | |
they may have brought out a high speed version. | |
I seem to remember they made two versions | |
of thin-film memory machine, which was very fast | |
memory. Unfortunately it couldn't have the huge | |
number of megabytes you can get with conventional | |
memories. | |
[You went on to work with STC on one of the first microprocessor based... | |
how did those servers differ from PCs] | |
As is often the case, a big company like IBM, they're not | |
necessarily first out of the gate with new things. It may take them | |
a while, especially if they don't have competition. | |
So part of what happened is due to that kind | |
of phenomenon. | |
But I don't know if the rest is due to that. | |
16:22 | |
Understand, this machine, the first one we did | |
as STC - the only one we did as STC - was | |
not supposed to be a high performance machine. | |
It was supposed to be below the IBM machines. | |
Because it supposed to serve as a server, or a | |
lead-in to the IBM machine. | |
Reality intruded. The first thing that happened, the | |
technology was 2x slower than we thought it was. | |
But fortunately it was also 2x faster than we thought | |
it was, so we recouped back. | |
But we were still in the hole, and we end up | |
pulling some pretty sophisticated tricks | |
and you don't want to do that | |
you're dealing with a group - not all - of neophytes | |
and you don't want to be tackling complicated | |
things if you can avoid it. | |
And that was, partially, what did them in. | |
It was too complicated, for them. | |
17:51 | |
[Something about marketing as the PC?] | |
Oh, Yeah! God love marketing people, I wanted to strangle them. | |
[laughter] | |
we were going for three years on this project, | |
sweating bullets, to try and to wring out | |
every last bit of performance | |
that these guys, the marketing people, were telling us | |
"we absolutely need that performance, we're going to | |
put you in a certain environment, you're not going to | |
sell that many machines" | |
and what happens, is that | |
"oh no, we don't really have to be in that environment | |
we want a cheaper machine that doesn't go as fast" | |
Sheesh, I wanted to strangle them, all of them. | |
It happens. | |
18:43 | |
BTW this is not the end of the programming, it's the | |
second part where you get to ask questions. | |
[question about moving on from the system/360 to consulting | |
how did you find your job had changed] | |
Oddly enough, I don't think, that much. | |
Except that it was a mistake, striving for too much | |
performance, was bad. We got ourselves a lot | |
of grief from doing that. | |
Because, you know, the architecture of the machine | |
is semi-stable. It changes, you have to upgrade with | |
the times, but it is semi-stable. | |
20:14 | |
[what was the original motivation for the OOO architecture | |
and how come it took so long to be used in production. | |
Why was the idea ahead of its time] | |
The short answer, we've already discussed it. One, there was | |
a machine, a successor, which had got scrapped. Without that | |
machine, and with the advances in cache, there was no point, | |
really, in out of order execution. At least not until the 80s. | |
You can argue about at what point it might have made | |
sense. | |
[Given that, why try to do OOO] | |
At the time, we were young | |
[laughter] | |
young and bold and we wanted to go for everything we can | |
get. And if we hadn't the idea, we would have built a | |
perfectly good machine which in the best case | |
might be 20 or 30% faster thanks to the floating point | |
guy. Because he speeded up the floating point. | |
So it wasn't that big a deal. But it was a coup. | |
It was something Seymour Cray didn't have. | |
No-one had it. So IBM could get some bragging rights | |
out of the whole thing. | |
[Back to 60s-70s what was it like to convince designers | |
and architects that OOO was good] | |
Pretty smart team! They didn't take any convincing. | |
I had the idea on a weekend, went in Monday morning | |
sketched out the bulk of the important parts of the idea. | |
They were thorough, they made sure they covered things, | |
they made sure there were no serious quibbles. | |
And then we were off and running. | |
Now, you know that you can do more with OOO | |
than we did. And in particular one thing which is very | |
important for the operating system is to be able to do | |
loads out of order. So that you can stack up loads which | |
might be delayed for whatever reason, and then maybe | |
get some other instruction through. Because that's the | |
main ... once you get rid of floating point - let's say long | |
instructions - then that's basically it. Even the OOO that | |
I was just talking about is not really a barn-burner kind of thing. | |
It's good to have. Like all of these things, you get to build | |
on them. We didn't get to build on them for like 30 years | |
but you get to build on them. | |
And out of the closet they come and you find you can do | |
something you couldn't do before. | |
25:00 | |
[What kind of design automation did you have] | |
Ha ha, none! | |
[laughter] | |
We were the first machine in IBM to simulate the logic of | |
the machine. And we could simulate 1000 gates at a time. | |
[laughter] | |
It's pitiful! I mean, the model 91 isn't by any means a huge | |
machine but it's like 40 or 50 thousand gates. | |
So that's nothing. | |
And this was something that developed towards the end of | |
the project. | |
Like I say, I give loads of credit, we had smart people. | |
Including this one guy and gal who were really DA people | |
and they were pushing what they could do, to improve | |
the performance of the machine. | |
26:28 | |
[Debugging efforts and kinds of problems/bugs, and the process | |
to fix] | |
Yes, And an interesting sideline to that, we had this programmer | |
who eventually wrote a simulation of the machine. | |
And he discovered two bugs, in the machine - I think it had to do | |
with fetching - because in those days we had a pretty sophisticated | |
fetching algorithm, where you start out with nothing and you try and | |
catch a loop. So we really didn't have much of anything in the way | |
of debugging. It was coming, but not for us. Which is too bad. | |
[Followup, how to debug the machine] | |
Well, that's somewhat of an art. Especially in those days. | |
28:38 | |
We had an interesting little experience. We were plagued by something | |
called the 'cracked stripe' | |
which none of you probably ever heard of. But it was a fault in the | |
technology | |
due to the extremely high current density that they had pushed the | |
circuits to, such that the wind, the electron wind, going through these very | |
fine circuits were blowing the atoms away. And you would get faults. | |
You would get open circuits. | |
So that was an interesting problem that we had to deal with. | |
[How to find that this was going on? How to deal with it?] | |
The 'cracked stripe' was special. We were experiencing one | |
failure every day. Now, most of you don't have experience debugging | |
a machine, but you can't debug a complex machine if you have one failure | |
a day. There are just too many things to find. | |
We were in real trouble. | |
And the answer was technology, and they had to fix the technology. | |
And in the case of the 91, they remade all the technology, in the case of | |
some of the slower machines in the 360 line they only partially remade them | |
and some they didn't remake at all. | |
Because it was a time-dependent thing. The faster your circuit was, the more | |
it was prone to this problem. | |
30:52 | |
[how long were these systems under development, from | |
Thomas Watson saying I want a fast computer] | |
well, what i have to do, which doesn't make a nice clean | |
picture, is the following. | |
We commenced on the model 91 in about 1963 | |
possibly late 62. | |
Because of the cracked stripe problem, it took us a very long | |
time to debug the machine. | |
If there'd been no cracked stripe problem we probably would have | |
brought up the machine two years earlier than we ultimately | |
brought it up. | |
So that was a real devastating blow to us. | |
And you know, it's really hard, when your hardware is failing under you, | |
it's hard to make progress | |
and it's failing in random ways | |
and sometimes you take it out of the machine | |
and it doesn't fail! | |
Now what do you do? | |
Thank your lucky stars if it fails next time. | |
[During development of 91 did you think about compiler optimisations] | |
We didn't have much to say about compilers | |
I was in the hardware group | |
we were conversant with some of their problems | |
later on there was more back and forth | |
because I always had - after the initial model 91 thing - a dual role of architect | |
in the early phases which is really | |
software architecture and then implementing the machine. | |
Does that answer your question? | |
[You mentioned Seymour Cray, he was still at Control Data | |
was there a lot of competition | |
did you take his machines apart?] | |
We did a little more of that later on | |
[laughter] | |
but no we didn't do that that much. | |
He had, it's very interesting how these things work out | |
because he had a jump start on us, okay | |
because he was already working on | |
his machine | |
and we were just starting on ours - we didn't even | |
have the technology to build our machine | |
So we were in real trouble. | |
So, what are we to do in this circumstance? How can we | |
make up lost ground. We tried all kinds of things, perhaps not | |
well-founded, to try and make up this lost ground | |
34:38 | |
But it's difficult. And what happened was, we were saved. We had an | |
assessment of how fast the Cray machine was. So we said | |
okay, we think it's this fast, they are going to be two years after us | |
so we've got to be twice as fast as them so we can ... come out. | |
That sounds feasible. | |
Well, it turns out the Cray is four times faster | |
[laughter] | |
than we thought. Meanwhile our machine is two times faster | |
than we thought it was. | |
The net result of all this fiddling around was rough parity. | |
There were some things we did faster | |
some kinds of problems they did faster. | |
But they still had - perhaps undeserved - the reputation | |
for raw compute speed. | |
And I think all of the customers, like the Atomic Energy Commission | |
laboratories, who were supposedly interested in that, they all went | |
to Cray, and IBM sold to - how to characterise it - database kind | |
of applications, that need a lot of memory and concurrency, a lot | |
of I/O running and all that kind of stuff. Doesn't particularly need a lot of floating point performance, | |
computational performance in general | |
[36:21] | |
[Tell us about how IBM culture changed over your career] | |
That's a tough one for me inside. | |
First, I have to divide my career into two parts. First, the five or six | |
years when I did my machine design - not my machine design - and then after that. | |
How did IBM change? It became less interesting and cutting edge. | |
In the beginning it was wide open, crazy ideas and if you could implement them | |
and it would buy some performance, you got it. | |
As time went on, you get more and more constrained, by the architecture | |
and by the necessity of other machines. You're not just allowed to design | |
for the model 91 class, you have to design a machine, you know, the next class | |
down, we had three or four [classes] by that time. So there were all kinds of | |
things standing in the way of pure performance. And you just had to | |
live with those. There's no way around it. | |
38:38 | |
[How about backward compatibility?] | |
Oh yeah, that was a must, that was a no-brainer, we weren't allowed to touch | |
that with a ten foot pole. In fact we had to get a special dispensation, because | |
the model 91, because of its out of order floating point, the effect on interrupts | |
was actually in violation of the architecture, and they had to get a special | |
dispensation for the model 91. | |
I don't think anyone ever suffered from this dispensation in those days, but | |
nevertheless you had to get it. | |
39:20 | |
[Any specific ideas from your team that wouldn't have worked] | |
You're asking me if any ideas sort of died? | |
That's really hard to answer. You like to think that you wring out the most | |
performance you can get out of your technology, from what you've got, | |
but that didn't really... really and truly there are all manner of compromises | |
that have to be made - not have to be made, some have to be made, other's | |
don't have to be made, but you're not omniscient, you don't know everything, | |
so it's really hard to say how much that affects machine design. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment