Skip to content

Instantly share code, notes, and snippets.

@nuts-n-bits
Last active March 8, 2022 09:56
Show Gist options
  • Save nuts-n-bits/d7818ac728aff197ff4bb25718badbdb to your computer and use it in GitHub Desktop.
Save nuts-n-bits/d7818ac728aff197ff4bb25718badbdb to your computer and use it in GitHub Desktop.
CMU Comp arch 15' Notes

Lecture 3 ISA

Last time

  • Von Neumann Model (Stored program + Sequential instruction) as opposed to dataflow
  • Algorithm
  • ISA
  • Moore's law
  • What is comp arch
  • Dataflow
  • ISA vs microarch

Microarchitecture

  • is a specific impl of the ISA
  • is not exposed to the software layer (we don't do that at this time)

    e.g. pipelining NOT EXPOSED e.g. Out of order execution NOT EXPOSED e.g. memory access scheduling policy NOT EXPOSED e.g. speculative execution NOT EXPOSED (today) e.g. superscalar processing NOT EXPOSED (mostly, see sepctre & meltdown) and many more....

Is part of ISA or uarch?

Opcode "+" ....................................... ISA
# of gen purpose registers ....................... ISA
# of ports to the register file .................. uarch
# of cycles to execute the MUL instr. ............ uarch
pipelining ....................................... uarch

REMEMBER: uarch is an impl of the ISA under specific design constraints and goals.

A design point is a set of design constraints and their importance

design point ==> leads to tradeoffs in both ISA and uarch.


This lecture:

  • ISA-level tradeoffs
  • uarch-level tradeoffs
  • system and task level tradeoffs (how to divide labour between HW and SW)

MIPS, ARM, ALPHA are all ISAs.

The following is a LC-3b add instr layout:

Layout 1:
15                    0
+----+---+---+-+--+---+
|0001|DR |SR1|0|00|SR2|
+----+---+---+-+--+---+

Layout 2:
15                    0
+----+---+---+-+------+
|0001|DR |SR1|1| imm5 |
+----+---+---+-+------+

Types of machines

  • 0-address machine (stack machine) compile this to stack machine: (7+5)x8x9 = 864
push 9
push 8
push 5
push 7
add 
mul
mul
pop => 864
  • 1-address machine: accumulator machine
  • 2-address machine: x86 and a many others
  • 3-address machine: MIPS, Lc-3b

Elements of ISA

Instructions

  • E.g OPCODE
  • E.g operand specifiers (addressing modes)

Data Types

E.g. int, float, char, binary, decimal, BCD (binary coded decimal), doubly linked list, queue, str, bit, vec, string (implicit, explicit)

endianness of data is also an aspect of the ISA

"Semanic gap"

    Programming language
+---------------------------+  High
|  List / DoublyLinkedList  |  
|  struct / Queue / stack   | <-- ISA ?
+---------------------------+ 
|  string / float / decimal | <-- ISA ?
|  bigint                   |  
+---------------------------+
|  int / byte / char        | <-- ISA ?
|                           |
+---------------------------+  Low
      Control signals

Memory organization

  • Address space
  • Addressing granularity byte addressible? 64-bit addressible? <= some supercomputers bit addressible? <= rare
  • Support for virtual memory?

Registers

  • How many?
  • How long?

Why registers Data temporal locality => reuse of data

Instruction Classes

Operate tnstructions

  • arithmatic / logical
  • fetch / compute / store
  • implicit sequential ctrl flow

Data movement instr's

  • MV data between memory and register
  • PC++

Control flow instr's

  • JMP

Elements of ISA (Cont.)

Load/Store (L/S) vs Memoey-to-memory (M2M)

  • L/S: operate only on registers, must load/store to interact with memory.
  • M2M: can operate directly on mem, can also load/store.

L/S: MIPS, ARM, other RISCs. M2M: x86, VAX, other CISCs.

Addressing Modes

  • Absolute: use immediate value (LW rt 10000)
  • Register indirect: reg as pointer (LW rt, r)
  • Displacement: reg as pointer + offset (LW rt, r[offset])
  • Indexed: LW rt, r, index, where r and index gen purpose
  • Mem indirect: reg -> mem[ptr] -> mem[data]
  • Auto inc/dec

Why more mem addr modes? This is programmer-uarch tradeoff. pro:

  • better mapping of high-level instr' to machine code
  • reduced # of instr' and code size (thus less mem bus band requirement)

    e.g. auto increment is good for memory traverse e.g. double indirect is good for ** and linked lists etc. e.g. sparce matrix access

  • better support for complex data structure.

con:

  • compiler needs more reasoning to pick the right addr mode
  • uarch more impl pain

Orthogonol ISA

An orthogonol ISA allows all addressing modes to be used on all instr. types.

e.g. VAX: ~13 addr modes >300 opcodes 2 formats (int/float) =780 actual addressing impls for uarch

pro:

  • flexible
  • easy to write asm
  • compiler can pick whatever it likes

con:

  • uarch hard to impl

Other Elements of the ISA (cont.)

  • Interface with IO devices
    • mem mapped IO
    • special IO instructions (IN,OUT in x86) Tradeoffs?
  • Privilege modes
    • user vs superuser
    • who can exe what instr.
  • Exception & Interrupt handling

    vectored vs. non-vectored interrupts vectored = knows who interrupted non-vectored = only knows it's interrupted

  • Virtual Memory
  • Access Protection (Segfault?)

and more....

Semantic gap

+---------------------------+  HLL
|    | Compiler             |
|    V                      |
+---------------------------+--- CISC ISA  
|    | uarch                |
|    |                      |
|    |                      |
|    V                      |
+---------------------------+  Control Signals

+---------------------------+  HLL
|    | Compiler             |
|    |                      |
|    |                      |
|    V                      |
+---------------------------+--- RISC ISA  
|    | uarch                |
|    V                      |
+---------------------------+  Control Signals

CISC: VAX INDEX instr. can index 5D array with bounds check with one instr.

Semantic gap tradeoffs

  • Compiler simplicity: CISC wins1
  • Hardware simplicity: RISC wins
  • Less burden of backwards compatibility: RISC wins

Instr length

  • Fixed
  • Variable

Uniformity

  • Uniform
  • Non-uniform

Usually:

Risc

Simple instr Fixed length Uniform decode Few addr modes

Cisc

Complex instr Variable length Non-uniform decode Many addr modes


References

Footnotes

  1. Compiler has more options to choose from to perform the same job. So implementing a correct compiler is easier. But the compiler has to weigh all the choices to see which one best fits the program, so having a optimal compiler is not necessarily easier.

Lecture 4: ISA tradeoffs (cont.) and the MIPS ISA

ISA tradeoffs (cont.)

Instruction lengths: Fixed vs variable tradeoff

Fixed:

  • Easier codec
  • Easier alignment
  • Indexable
  • Can decode multiple instructions concurrently

Variable:

  • More compact code (ergo lower mem bus bandwith requirement)
  • Better extensibility (if done right)

Intel: they profiles their programs and assign huffman encodings on the instructions!

Uniformity tradeoff

Uniform means that the same bits always represent the same meanings. I.e. opcodes always in the same location, so are operand specifiers, imm values, etc.

Uniform pro:

  • Easier codec (=> simpler hardware)
  • Enables parallelism: can start decoding target address before opcode is decoded

Con:

  • Restricts instr format
  • Wastes bits (and ergo wastes mem bandwidth)

Uniform decode usually means fixed length, probably can't have uniform for variable

Usually, RISC:

  • Simple instructions
  • Fixed length
  • Uniform decode
  • Few addressing modes

Usually, CISC:

  • Complex instructions
  • Variable length
  • Non-uniform decode
  • Many addressing modes

Number of registers tradeoff

The number of registers immediately decides how many bits you need to use to address the registers. More regs => more bits to reference a reg.

Affects uarch: size, access time, power comsumption of register file, etc.

Large number of registers Pro:

  • Better register allocation and optimization by compilers (because fewer saves and restores), essentially a larger "L0" cache
  • Potentially fewer instructions caused by spilling/filling*

Con:

  • Larger instr size per instr.
  • Larger register file size
  • More power consumption (since SRAM is impl'd by oscillation circuit)

*: If there is not enough registers for some value, it is pushed onto the stack (most compilers), then when there is room, it is brought back. This is called "Spilling"/"Filling"

Addressing modes tradeoff

  • Immediate (data = reg)
  • Register indirect (data = mem[reg])
  • Memory indirect (data = mem[mem[reg]])
  • More

    Displacement, indexed, absolute, autoincrement, autodecrement, ...

having lots of modes:

pro better support for programming constructs Implements data structs easily, effeciently.

con harder for uarch too many choices for compiler?

Manyways to do the same thing complicates compiler design, see 1

  • (Index * Scale) + Displacement
  • Base + Index + Displacement
  • Base + (Index * Scale) + Displacement

Other tradeoffs

  • Condition code vs not

    Conditional code e.g. x86 e-flag

  • VLIW vs single instruction
  • Precise vs inprecise exceptions

    Precise means if and when exceptions are raised, non of the code after the exception point is ececuted, and all the code before that is executed. Pertains to OOO-E

  • Virtual memory or not
  • Aligned accesses?
  • Hardware interlocks vs software-guaranteed interlocking? (inter-instruction dependency checking)

    MIPS = Microprocessors without Interlocked Pipeline Stages

  • Software vs hardware managed page fault handling
  • Cache coherence (HW vs. SW)
  • etc.....

Programmers vs. (Micro)architecture

  • Many ISA features designed to aid programmers, but complicate uarch, HW design

    E.g. virtual memory. Q: Should the programmer be concerned about the size of his codeblocks fitting into physical memory? If yes, then you support no virtual mem If no, then you support virtual mem

Mips requires mem access be aligned at 4-byte boundary. LW/SW instructions must follow this requirement

Not designed to fetch memory bytes not within a word boundry Does not offer rotation of unalgined bytes into registers.

MIPS provides separate opcodes for "infrequent" case of cross-boundary access

1

But LWL and LWR are slower And they still could only fetch within boundary

x86 allows unaligned bytes, including cross boundary access. LD/ST automatically handles it, compilers need not worry. However with a caveat:

2

Image: x86 manual warning compilers: you should still try to align it because unaligned mem accesses require 2 separate undelying accesses just like mips. It's just that uarch handles that for you.

Exercise: What are the pros and cons for aligned/unaligned mem accesses?

  • Pros
       
  • Cons    

Part 2: MIPS ISA

MIPS R2000 Program visible state:

[ PC ]
+---------------------+
| Program Counter     |
+---------------------+
32-bit

[ Memory ]
+---------------------+
| M[0]                |
| M[1]                |
| M[2]                |
| M[3]                |
| ......              |
| M[N-1]              |
+---------------------+
2^32 locations, 8 bits each,
represented by 32 bit address
(there's some magic going on)

[ Gen purpose regs ]
+---------------------+
| r0                  |
| r0                  |
| r2                  |
| ......              |
| r31                 |
+---------------------+
General purpose register file,
32 integers, 32 bits each

Data format

  • Most things are 32 bits
    • instructions and data addrs
    • signed and unsigned integers
  • Also exists 16-bit words and 8-bit words (aka bytes)
  • Floating-point numbers
    • IEEE 754
    • float: 8-bit exponent, 23-bit significand
    • double: 11-bit exponent, 52-bit significand

Endianness

           Big Endian
MSB                           LSB
[ byte0 | byte1 | byte2 | byte3 ]

           Little Endian
MSB                           LSB
[ byte3 | byte2 | byte1 | byte0 ]

Most of the time, endianness is simply a matter of convention and interoperation.

Endianness could impact performance (rarely and subtly). E.g. if wishes to obtain 16 LSB, LE could just set 16 MSB to 0, but BE must shift.

Instruction format

  • 3 Simple formats
    • R-type, 3 register operands
      [000000| rs  | rt  | rd  |shamt|funct ]
       6 bit   5     5     5    5     6  
      
    • I type, 2 register operands and 16-bit imm
      [opcode| rs  | rt  | imm              ]
       6 bit   5     5     16 
      
    • J type, 26-bit imm
      [opcode| imm                          ]
       6 bit   26
      
  • Simple Encoding
    • 4 bytes per instruction
    • Must be 4-byte aligned

ALU instructions

E.g. ADD rd rs rt.

This is the intel syntax, where the above asm translates to rd = rs + rt.

MIPS encoding of the above asm:

[000000| rs  | rt  | rd  |00000| ADD  ] R-type
 6       5     5     5    5      6
  • Semantics:
    1. rd := rs + rt
    2. pc := pc + 4
  • Will throw exception if overflow
Unrelated sidenote

Q: how to load 32-bit immediate value if MIPS only supports 26 bit max immediate in its encoding?

A: addiu $5, $5, 0xbeb0063d is broken down into

lui $1, -16720 // 0xbeb00000
ori $1, $1, 1597  // 0x063d

3


References

Footnotes

  1. Wulf, Compilers and Computer Architecture, IEEE Computer, 1981, [PDF1 Fast], [PDF2 HD]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment