Skip to content

Instantly share code, notes, and snippets.

@erincandescent
Created July 25, 2019 23:32
Show Gist options
  • Save erincandescent/8a10eeeea1918ee4f9d9982f7618ef68 to your computer and use it in GitHub Desktop.
Save erincandescent/8a10eeeea1918ee4f9d9982f7618ef68 to your computer and use it in GitHub Desktop.

Foreward

This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.

It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.

Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2

Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions.

Consider the following C code, for example:

int readidx(int *p, size_t idx)
{ return p[idx]; }

This is a simple case of array indexing, a very common operation. Consider the compilation of this for x86_64:

mov eax, [rdi+rsi*4]
ret

or ARM:

ldr r0, [r0, r1, lsl #2]
bx lr // return

Meanwhile, the required code for RISC-V:

# apologies for any syntax nits - there aren't any online risc-v
# compilers
slli a1, a1, 2
add a0, a1, a1
lw a0, a0, 0
jalr r0, r1, 0 // return

RISC-V's simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial - x86 is a particularly bad case of this with its' numerous prefixes).

The simplification of an instruction set should not be pursued to its' limits. A register + shifted register memory operation is not a complicated instruction; it is a very common operation in programs, and very easy for a CPU to implement performantly. If a CPU is not capable of implementing the instruction directly, it can break it down into its' constituent operations with relative ease; this is a much easier problem than fusing sequences of simple operations.

We should distinguish the "Complex" instructions of CISC CPUs - complicated, rarely used, and universally low performance, from the "Featureful" instructions common to both CISC and RISC CPUs, which combine a small sequence of operations, are commonly used, and high performance.

The Middling

  • Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.
  • Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)
    • Call: Rd = R1
    • Return: Rd = R0, Rs = R1
    • Indirect branch: Rd = R0, RsR1
    • (Weirdo branch: RdR0, RdR1)
  • Variable length encoding not self synchronizing (This is common - e.g x86 and Thumb-2 both have this issue - but it causes various problems both with implementation and security e.g. return-oriented-programming attacks)
  • RV64I requires sign extension of all 32-bit values. This produces unnecessary top-half toggling or requires special accomodation of the upper half of registers. Zero extension is preferable (as it reduces toggling, and can generally be optimized by tracking an "is zero" bit once the upper half is known to be zero)
  • Multiply is optional - while fast multipliers occupy non-negligible area on tiny implementations, small multipliers can be created which consume little area, and it is possible to make extensive re-use of the existing ALU for a multiple-cycle multiplications.
  • LR/SC has a strict eventual forward progress requirement for a limited subset of uses. While this constraint is quite tight, it does potentially pose some problems for small implementations (particularly those without cache)
    • This appears to be a substitute for a CAS instruction, see comments on that
  • FP sticky bits and rounding mode are in the same register. This requires serialization of the FP pipe if a RMW operation is performed to change rounding mode
  • FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)
    • This could be easily rectified - size encoding 2'b10 is free
    • Update: V2.2 has a decimal FP extension placeholder, but no half-precision placeholder. The mind kinda boggles.
  • How FP values are represented in the FP register file is unspecified but observable (by load/store)
    • Emulator authors will hate you
    • VM migration may become impossible
    • Update: V2.2 requires NaN boxing wider values

The Bad

  • No condition codes, instead compare-and-branch instructions. This is not problematic by itself, but rather in its' implications:
    • Decreased encoding space in conditional branches due to requirement to encode one or two register specifiers
    • No conditional selects (useful for highly unpredictable branches)
    • No add with carry/subtract with carry or borrow
    • (Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)
  • Highly precise counters seem to be required by the user level ISA. In practice, exposing these to applications is a great vector for sidechannel attacks
  • Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not
  • No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common, and LL/SC type atomics inexpensive (only 1 bit of CPU state required for minimal single CPU implementations).
  • LR/SC are in the same extension as more complicated atomic instructions, which limits implementation flexibility for small implementations
  • General (non LR/SC) atomics do not include a CAS primitive
    • The motivation is to avoid the need for an instruction which reads 5 registers (Addr, CmpHi:CmpLo, SwapHi:SwapLo), but this is likely to impose less overhead on the implementation than the guaranteed-forward-progress LR/SC which is provided to replace it
  • Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit
  • For RV32I, no way to tranfer a DP FP value between the integer and FP register files except through memory
  • e.g. RV32I 32-bit ADD and RV64I 64-bit ADD share encodings, and RVI64 adds a different ADD.W encoding. This is needless complication for a CPU which implements both instructions - it would have been preferable to add a new 64-bit encoding instead
  • No MOV instruction. The MV assembler alias is implemted as MV rD, rS -> ADDI rD, rS, 0. MOV optimization is commonly performed by high-end processors (especially out-of-order); recognizing RISC-V's canonical MV requires oring a 12-bit immediate
    • Absent a MOV instruction, ADD rD, rS, r0 would actually be a preferable canonical MOV as it is easier to decode and CPUs normally have special case logic for recognizing the zero register

The Ugly

  • JAL wastes 5 bits encoding the link register, which will always be R1 (or R0 for branches)
    • This means that RV32I has 21-bit branch displacements (insufficient for large applications - e.g. web browsers - without using multiple instruction sequences and/or branch islands)
    • This is a regression from the v1.0 ISA!
  • Despite great effort being expended on a uniform encoding, load/store instructions are encoded differently (register vs immediate fields swapped)
    • It seems orthogonality of destination register encoding was preferred over orthogonality of encoding two highly related instructions. This choice seems a little odd given that address generation is the more timing critical operation.
  • No loads with register offsets (Rbase+Roffset) or indexes (Rbase+Rindex << Scale).
  • FENCE.I implies full synchronization of instruction cache with all preceding stores, fenced or unfenced. Implementations will need to either flush entire I$ on fence, or snoop both D$ and the store buffer
  • In RV32I, reading the 64-bit counters requires reading upper half twice, comparing and branching in case a carry occurs between the lower and upper half during a read operation
    • Normally 32-bit ISAs include a "read pair of special registers" instruction to avoid this issue
  • No architecturally defined "hint" encoding space. Hint encodings are those which execute as NOPs on current processors but which have some behavior on later varients
    • Common examples of pure "NOP hints" are things like spinlock yields.
    • More complicated hints have also been implemented (i.e. those which have visible side effects on new processors; for example, the x86 bounds checking instructions are encoded in hint space so that binaries remain backwards compatible)
@Wren6991
Copy link

Wren6991 commented Oct 5, 2025

This was linked from Hacker News again so I thought it would be interesting to note some things that have changed in RISC-V since this post was written. This is meant as context, not a rebuttal.

Meanwhile, the required code for RISC-V:

The slli; add pair would now be sh2add (from Zba, included in all RVA profiles). Unfortunately there is still no scaled-indexed load.

RV64I requires sign extension of all 32-bit values

Zba also includes unsigned-word *.uw variants of common instructions like adds and shifts.

FP sticky bits and rounding mode are in the same register

fcsr is also aliased as fflags and frm, which breaks the false dependency.

FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)

Zfhmin is mandatory in RVA22U64. This adds transfer and conversion operations for FP16, so you can have FP16 in memory and do FP32 in registers (giving memory pressure benefits but not ALU throughput benefits).

V extension has FP16 format too, and this is mandatory in RVA23.

No conditional selects (useful for highly unpredictable branches)

We have some slightly better selection sequences now due to Zicond (mandatory in RVA23). P extension will improve this further with MVM, MVMN, MERGE.

No add with carry/subtract with carry or borrow

V extension has carry-through-mask, which covers the bignum use case. Mandatory in RVA23.

Highly precise counters seem to be required by the user level ISA. In practice, exposing these to applications is a great vector for sidechannel attacks

Zicntr is now optional, although frustratingly made mandatory by RVA profiles. However, M-mode controls whether these counters are exposed to U-mode, via mcounteren, so not sure about the security argument.

Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not

We have the Zmmul extension now (just the multiplies from the M extension).

LR/SC are in the same extension as more complicated atomic instructions, which limits implementation flexibility for small implementations

A is now separated into Zalrsc and Zaamo.

General (non LR/SC) atomics do not include a CAS primitive

Now exists in Zacas. Not mandatory in RVA23, but spec notes it is intended to become mandatory in a future profile.

Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit

Now exists in Zabha. Not mandatory in RVA23, but spec notes it is intended to become mandatory in a future profile.

No architecturally defined "hint" encoding space. Hint encodings are those which execute as NOPs on current processors but which have some behavior on later varients

As well as architecturally defined HINTs we now have maybe-ops (Zimop/Zcmop) which are hints on current implementations that may become non-hints in future (like CFI landing pads).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment