nuts-n-bits/Lec 3 notes.md

Last active March 8, 2022 09:56

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/nuts-n-bits/d7818ac728aff197ff4bb25718badbdb.js"></script>
Save nuts-n-bits/d7818ac728aff197ff4bb25718badbdb to your computer and use it in GitHub Desktop.

Download ZIP

CMU Comp arch 15' Notes

Raw

Lec 3 notes.md

Lecture 3 ISA

Last time

Von Neumann Model (Stored program + Sequential instruction) as opposed to dataflow
Algorithm
ISA
Moore's law
What is comp arch
Dataflow
ISA vs microarch

Microarchitecture

is a specific impl of the ISA
is not exposed to the software layer (we don't do that at this time)

e.g. pipelining NOT EXPOSED e.g. Out of order execution NOT EXPOSED e.g. memory access scheduling policy NOT EXPOSED e.g. speculative execution NOT EXPOSED (today) e.g. superscalar processing NOT EXPOSED (mostly, see sepctre & meltdown) and many more....

Is part of ISA or uarch?

Opcode "+" ....................................... ISA
# of gen purpose registers ....................... ISA
# of ports to the register file .................. uarch
# of cycles to execute the MUL instr. ............ uarch
pipelining ....................................... uarch

REMEMBER: uarch is an impl of the ISA under specific design constraints and goals.

A design point is a set of design constraints and their importance

design point ==> leads to tradeoffs in both ISA and uarch.

This lecture:

ISA-level tradeoffs
uarch-level tradeoffs
system and task level tradeoffs (how to divide labour between HW and SW)

MIPS, ARM, ALPHA are all ISAs.

The following is a LC-3b add instr layout:

Layout 1:
15                    0
+----+---+---+-+--+---+
|0001|DR |SR1|0|00|SR2|
+----+---+---+-+--+---+

Layout 2:
15                    0
+----+---+---+-+------+
|0001|DR |SR1|1| imm5 |
+----+---+---+-+------+

Types of machines

0-address machine (stack machine) compile this to stack machine: (7+5)x8x9 = 864

push 9
push 8
push 5
push 7
add 
mul
mul
pop => 864

1-address machine: accumulator machine
2-address machine: x86 and a many others
3-address machine: MIPS, Lc-3b

Elements of ISA

Instructions

E.g OPCODE
E.g operand specifiers (addressing modes)

Data Types

E.g. int, float, char, binary, decimal, BCD (binary coded decimal), doubly linked list, queue, str, bit, vec, string (implicit, explicit)

endianness of data is also an aspect of the ISA

"Semanic gap"

    Programming language
+---------------------------+  High
|  List / DoublyLinkedList  |  
|  struct / Queue / stack   | <-- ISA ?
+---------------------------+ 
|  string / float / decimal | <-- ISA ?
|  bigint                   |  
+---------------------------+
|  int / byte / char        | <-- ISA ?
|                           |
+---------------------------+  Low
      Control signals

Memory organization

Address space
Addressing granularity byte addressible? 64-bit addressible? <= some supercomputers bit addressible? <= rare
Support for virtual memory?

Registers

How many?
How long?

Why registers Data temporal locality => reuse of data

Instruction Classes

Operate tnstructions

arithmatic / logical
fetch / compute / store
implicit sequential ctrl flow

Data movement instr's

MV data between memory and register
PC++

Control flow instr's

Elements of ISA (Cont.)

Load/Store (L/S) vs Memoey-to-memory (M2M)

L/S: operate only on registers, must load/store to interact with memory.
M2M: can operate directly on mem, can also load/store.

L/S: MIPS, ARM, other RISCs. M2M: x86, VAX, other CISCs.

Addressing Modes

Absolute: use immediate value (LW rt 10000)
Register indirect: reg as pointer (LW rt, r)
Displacement: reg as pointer + offset (LW rt, r[offset])
Indexed: LW rt, r, index, where r and index gen purpose
Mem indirect: reg -> mem[ptr] -> mem[data]
Auto inc/dec

Why more mem addr modes? This is programmer-uarch tradeoff. pro:

better mapping of high-level instr' to machine code

reduced # of instr' and code size (thus less mem bus band requirement)

e.g. auto increment is good for memory traverse e.g. double indirect is good for ** and linked lists etc. e.g. sparce matrix access

better support for complex data structure.

con:

compiler needs more reasoning to pick the right addr mode

uarch more impl pain

Orthogonol ISA

An orthogonol ISA allows all addressing modes to be used on all instr. types.

e.g. VAX: ~13 addr modes >300 opcodes 2 formats (int/float) =780 actual addressing impls for uarch

pro:

flexible
easy to write asm
compiler can pick whatever it likes

con:

uarch hard to impl

Other Elements of the ISA (cont.)

Interface with IO devices
- mem mapped IO
- special IO instructions (IN,OUT in x86) Tradeoffs?
Privilege modes
- user vs superuser
- who can exe what instr.
Exception & Interrupt handling

vectored vs. non-vectored interrupts vectored = knows who interrupted non-vectored = only knows it's interrupted
Virtual Memory
Access Protection (Segfault?)

and more....

Semantic gap

+---------------------------+  HLL
|    | Compiler             |
|    V                      |
+---------------------------+--- CISC ISA  
|    | uarch                |
|    |                      |
|    |                      |
|    V                      |
+---------------------------+  Control Signals

+---------------------------+  HLL
|    | Compiler             |
|    |                      |
|    |                      |
|    V                      |
+---------------------------+--- RISC ISA  
|    | uarch                |
|    V                      |
+---------------------------+  Control Signals

CISC: VAX INDEX instr. can index 5D array with bounds check with one instr.

Semantic gap tradeoffs

Compiler simplicity: CISC wins¹
Hardware simplicity: RISC wins
Less burden of backwards compatibility: RISC wins

Instr length

Fixed
Variable

Uniformity

Uniform
Non-uniform

Usually:

Risc

Simple instr Fixed length Uniform decode Few addr modes

Cisc

Complex instr Variable length Non-uniform decode Many addr modes

References

Compiler has more options to choose from to perform the same job. So implementing a correct compiler is easier. But the compiler has to weigh all the choices to see which one best fits the program, so having a optimal compiler is not necessarily easier. ↩

Raw

Lec 4 notes.md

Lecture 4: ISA tradeoffs (cont.) and the MIPS ISA

ISA tradeoffs (cont.)

Instruction lengths: Fixed vs variable tradeoff

Fixed:

Easier codec
Easier alignment
Indexable
Can decode multiple instructions concurrently

Variable:

More compact code (ergo lower mem bus bandwith requirement)
Better extensibility (if done right)

Intel: they profiles their programs and assign huffman encodings on the instructions!

Uniformity tradeoff

Uniform means that the same bits always represent the same meanings. I.e. opcodes always in the same location, so are operand specifiers, imm values, etc.

Uniform pro:

Easier codec (=> simpler hardware)
Enables parallelism: can start decoding target address before opcode is decoded

Con:

Restricts instr format
Wastes bits (and ergo wastes mem bandwidth)

Uniform decode usually means fixed length, probably can't have uniform for variable

Usually, RISC:

Simple instructions

Fixed length

Uniform decode

Few addressing modes

Usually, CISC:

Complex instructions

Variable length

Non-uniform decode

Many addressing modes

Number of registers tradeoff

The number of registers immediately decides how many bits you need to use to address the registers. More regs => more bits to reference a reg.

Affects uarch: size, access time, power comsumption of register file, etc.

Large number of registers Pro:

Better register allocation and optimization by compilers (because fewer saves and restores), essentially a larger "L0" cache
Potentially fewer instructions caused by spilling/filling*

Con:

Larger instr size per instr.
Larger register file size
More power consumption (since SRAM is impl'd by oscillation circuit)

*: If there is not enough registers for some value, it is pushed onto the stack (most compilers), then when there is room, it is brought back. This is called "Spilling"/"Filling"

Addressing modes tradeoff

Immediate (data = reg)
Register indirect (data = mem[reg])
Memory indirect (data = mem[mem[reg]])
More

Displacement, indexed, absolute, autoincrement, autodecrement, ...

having lots of modes:

pro better support for programming constructs Implements data structs easily, effeciently.

con harder for uarch too many choices for compiler?

Manyways to do the same thing complicates compiler design, see ¹

(Index * Scale) + Displacement
Base + Index + Displacement
Base + (Index * Scale) + Displacement

Other tradeoffs

Condition code vs not

Conditional code e.g. x86 e-flag
VLIW vs single instruction
Precise vs inprecise exceptions

Precise means if and when exceptions are raised, non of the code after the exception point is ececuted, and all the code before that is executed. Pertains to OOO-E
Virtual memory or not
Aligned accesses?
Hardware interlocks vs software-guaranteed interlocking? (inter-instruction dependency checking)

MIPS = Microprocessors without Interlocked Pipeline Stages
Software vs hardware managed page fault handling
Cache coherence (HW vs. SW)
etc.....

Programmers vs. (Micro)architecture

Many ISA features designed to aid programmers, but complicate uarch, HW design

E.g. virtual memory. Q: Should the programmer be concerned about the size of his codeblocks fitting into physical memory? If yes, then you support no virtual mem If no, then you support virtual mem

Mips requires mem access be aligned at 4-byte boundary. LW/SW instructions must follow this requirement

Not designed to fetch memory bytes not within a word boundry Does not offer rotation of unalgined bytes into registers.

MIPS provides separate opcodes for "infrequent" case of cross-boundary access

But LWL and LWR are slower And they still could only fetch within boundary

x86 allows unaligned bytes, including cross boundary access. LD/ST automatically handles it, compilers need not worry. However with a caveat:

Image: x86 manual warning compilers: you should still try to align it because unaligned mem accesses require 2 separate undelying accesses just like mips. It's just that uarch handles that for you.

Exercise: What are the pros and cons for aligned/unaligned mem accesses?

Pros

Cons

Part 2: MIPS ISA

MIPS R2000 Program visible state:

[ PC ]
+---------------------+
| Program Counter     |
+---------------------+
32-bit

[ Memory ]
+---------------------+
| M[0]                |
| M[1]                |
| M[2]                |
| M[3]                |
| ......              |
| M[N-1]              |
+---------------------+
2^32 locations, 8 bits each,
represented by 32 bit address
(there's some magic going on)

[ Gen purpose regs ]
+---------------------+
| r0                  |
| r0                  |
| r2                  |
| ......              |
| r31                 |
+---------------------+
General purpose register file,
32 integers, 32 bits each

Data format

Most things are 32 bits
- instructions and data addrs
- signed and unsigned integers
Also exists 16-bit words and 8-bit words (aka bytes)
Floating-point numbers
- IEEE 754
- float: 8-bit exponent, 23-bit significand
- double: 11-bit exponent, 52-bit significand

Endianness

           Big Endian
MSB                           LSB
[ byte0 | byte1 | byte2 | byte3 ]

           Little Endian
MSB                           LSB
[ byte3 | byte2 | byte1 | byte0 ]

Most of the time, endianness is simply a matter of convention and interoperation.

Endianness could impact performance (rarely and subtly). E.g. if wishes to obtain 16 LSB, LE could just set 16 MSB to 0, but BE must shift.

Instruction format

3 Simple formats

R-type, 3 register operands

[000000| rs  | rt  | rd  |shamt|funct ]
 6 bit   5     5     5    5     6

I type, 2 register operands and 16-bit imm

[opcode| rs  | rt  | imm              ]
 6 bit   5     5     16

J type, 26-bit imm

[opcode| imm                          ]
 6 bit   26

Simple Encoding
- 4 bytes per instruction
- Must be 4-byte aligned

ALU instructions

E.g. ADD rd rs rt.

This is the intel syntax, where the above asm translates to rd = rs + rt.

MIPS encoding of the above asm:

[000000| rs  | rt  | rd  |00000| ADD  ] R-type
 6       5     5     5    5      6

Semantics:
1. rd := rs + rt
2. pc := pc + 4
Will throw exception if overflow

Unrelated sidenote

Q: how to load 32-bit immediate value if MIPS only supports 26 bit max immediate in its encoding?

A: addiu $5, $5, 0xbeb0063d is broken down into
lui $1, -16720 // 0xbeb00000
ori $1, $1, 1597  // 0x063d

References

Wulf, Compilers and Computer Architecture, IEEE Computer, 1981, [PDF1 Fast], [PDF2 HD] ↩

nuts-n-bits/Lec 3 notes.md

Lecture 3 ISA

Last time

Microarchitecture

Is part of ISA or uarch?

Types of machines

Elements of ISA

Instructions

Data Types

"Semanic gap"

Memory organization

Registers

Instruction Classes

Operate tnstructions

Data movement instr's

Control flow instr's

Elements of ISA (Cont.)

Load/Store (L/S) vs Memoey-to-memory (M2M)

Addressing Modes

Orthogonol ISA

Other Elements of the ISA (cont.)

Semantic gap

Semantic gap tradeoffs

Instr length

Uniformity

References

Footnotes

Lecture 4: ISA tradeoffs (cont.) and the MIPS ISA

ISA tradeoffs (cont.)

Instruction lengths: Fixed vs variable tradeoff

Uniformity tradeoff

Number of registers tradeoff

Addressing modes tradeoff

Other tradeoffs

Programmers vs. (Micro)architecture

Part 2: MIPS ISA

Data format

Endianness

Instruction format

ALU instructions

Unrelated sidenote

References

Footnotes