This chapter describes the RV32I base integer instruction set.
RV32I was designed to be sufficient to form a compiler target and to support modern operating system environments. The ISA was also designed to reduce the hardware required in a minimal implementation. RV32I contains 40 unique instructions, though a simple implementation might cover the ECALL/EBREAK instructions with a single SYSTEM hardware instruction that always traps and might be able to implement the FENCE instruction as a NOP, reducing base instruction count to 38 total. RV32I can emulate almost any other ISA extension (except the A extension, which requires additional hardware support for atomicity).
In practice, a hardware implementation including the machine-mode privileged architecture will also require the 6 CSR instructions.
Subsets of the base integer ISA might be useful for pedagogical purposes, but the base has been defined such that there should be little incentive to subset a real hardware implementation beyond omitting support for misaligned memory accesses and treating all SYSTEM instructions as a single trap.
The standard RISC-V assembly language syntax is documented in the Assembly Programmer’s Manual .
Most of the commentary for RV32I also applies to the RV64I base.
Figure [gprs] shows the unprivileged state for the
base integer ISA. For RV32I, the 32 x
registers are each 32 bits wide,
i.e., XLEN=32. Register x0
is hardwired with all bits equal to 0.
General purpose registers x1
–x31
hold values that various
instructions interpret as a collection of Boolean values, or as two’s
complement signed binary integers or unsigned binary integers.
There is one additional unprivileged register: the program counter pc
holds the address of the current instruction.
XLEN |
XLEN |
There is no dedicated stack pointer or subroutine return address link
register in the Base Integer ISA; the instruction encoding allows any
x
register to be used for these purposes. However, the standard
software calling convention uses register x1
to hold the return
address for a call, with register x5
available as an alternate link
register. The standard calling convention uses register x2
as the
stack pointer.
Hardware might choose to accelerate function calls and returns that use
x1
or x5
. See the descriptions of the JAL and JALR instructions.
The optional compressed 16-bit instruction format is designed around the
assumption that x1
is the return address register and x2
is the
stack pointer. Software using other conventions will operate correctly
but may have greater code size.
The number of available architectural registers can have large impacts on code size, performance, and energy consumption. Although 16 registers would arguably be sufficient for an integer ISA running compiled code, it is impossible to encode a complete ISA with 16 registers in 16-bit instructions using a 3-address format. Although a 2-address format would be possible, it would increase instruction count and lower efficiency. We wanted to avoid intermediate instruction sizes (such as Xtensa’s 24-bit instructions) to simplify base hardware implementations, and once a 32-bit instruction size was adopted, it was straightforward to support 32 integer registers. A larger number of integer registers also helps performance on high-performance code, where there can be extensive use of loop unrolling, software pipelining, and cache tiling.
For these reasons, we chose a conventional size of 32 integer registers for RV32I. Dynamic register usage tends to be dominated by a few frequently accessed registers, and regfile implementations can be optimized to reduce access energy for the frequently accessed registers . The optional compressed 16-bit instruction format mostly only accesses 8 registers and hence can provide a dense instruction encoding, while additional instruction-set extensions could support a much larger register space (either flat or hierarchical) if desired.
For resource-constrained embedded applications, we have defined the RV32E subset, which only has 16 registers (Chapter [rv32e]).
In the base RV32I ISA, there are four core instruction formats (R/I/S/U), as shown in Figure [fig:baseinstformats]. All are a fixed 32 bits in length. The base ISA has IALIGN=32, meaning that instructions must be aligned on a four-byte boundary in memory. An instruction-address-misaligned exception is generated on a taken branch or unconditional jump if the target address is not IALIGN-bit aligned. This exception is reported on the branch or jump instruction, not on the target instruction. No instruction-address-misaligned exception is generated for a conditional branch that is not taken.
The alignment constraint for base ISA instructions is relaxed to a two-byte boundary when instruction extensions with 16-bit lengths or other odd multiples of 16-bit lengths are added (i.e., IALIGN=16).
Instruction-address-misaligned exceptions are reported on the branch or jump that would cause instruction misalignment to help debugging, and to simplify hardware design for systems with IALIGN=32, where these are the only places where misalignment can occur.
The behavior upon decoding a reserved instruction is .
Some platforms may require that opcodes reserved for standard use raise an illegal-instruction exception. Other platforms may permit reserved opcode space be used for non-conforming extensions.
funct7 | rs2 | rs1 | funct3 | rd | opcode | R-type |
imm[11:0] | rs1 | funct3 | rd | opcode | I-type | |
imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode | S-type |
imm[31:12] | rd | opcode | U-type |
The RISC-V ISA keeps the source (rs1 and rs2) and destination (rd) registers at the same position in all formats to simplify decoding. Except for the 5-bit immediates used in CSR instructions (Chapter [csrinsts]), immediates are always sign-extended, and are generally packed towards the leftmost available bits in the instruction and have been allocated to reduce hardware complexity. In particular, the sign bit for all immediates is always in bit 31 of the instruction to speed sign-extension circuitry.
Decoding register specifiers is usually on the critical paths in implementations, and so the instruction format was chosen to keep all register specifiers at the same position in all formats at the expense of having to move immediate bits across formats (a property shared with RISC-IV aka. SPUR ).
In practice, most immediates are either small or require all XLEN bits. We chose an asymmetric immediate split (12 bits in regular instructions plus a special load-upper-immediate instruction with 20 bits) to increase the opcode space available for regular instructions.
Immediates are sign-extended because we did not observe a benefit to using zero-extension for some immediates as in the MIPS ISA and wanted to keep the ISA as simple as possible.
There are a further two variants of the instruction formats (B/J) based on the handling of immediates, as shown in Figure [fig:baseinstformatsimm].
funct7 | rs2 | rs1 | funct3 | rd | opcode | R-type | |||
imm[11:0] | rs1 | funct3 | rd | opcode | I-type | ||||
imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode | S-type | |||
imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | B-type | |
imm[31:12] | rd | opcode | U-type | ||||||
imm[20] | imm[10:1] | imm[11] | imm[19:12] | rd | opcode | J-type |
The only difference between the S and B formats is that the 12-bit immediate field is used to encode branch offsets in multiples of 2 in the B format. Instead of shifting all bits in the instruction-encoded immediate left by one in hardware as is conventionally done, the middle bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest bit in S format (inst[7]) encodes a high-order bit in B format.
Similarly, the only difference between the U and J formats is that the 20-bit immediate is shifted left by 12 bits to form U immediates and by 1 bit to form J immediates. The location of instruction bits in the U and J format immediates is chosen to maximize overlap with the other formats and with each other.
Figure [fig:immtypes] shows the immediates produced by each of the base instruction formats, and is labeled to show which instruction bit (inst[y ]) produces each bit of the immediate value.
— inst[31] — | inst[30:25] | inst[24:21] | inst[20] | I-immediate | |||
— inst[31] — | inst[30:25] | inst[11:8] | inst[7] | S-immediate | |||
— inst[31] — | inst[7] | inst[30:25] | inst[11:8] | 0 | B-immediate | ||
inst[31] | inst[30:20] | inst[19:12] | — 0 — | U-immediate | |||
— inst[31] — | inst[19:12] | inst[20] | inst[30:25] | inst[24:21] | 0 | J-immediate |
Sign-extension is one of the most critical operations on immediates (particularly for XLEN>32), and in RISC-V the sign bit for all immediates is always held in bit 31 of the instruction to allow sign-extension to proceed in parallel with instruction decoding.
Although more complex implementations might have separate adders for branch and jump calculations and so would not benefit from keeping the location of immediate bits constant across types of instruction, we wanted to reduce the hardware cost of the simplest implementations. By rotating bits in the instruction encoding of B and J immediates instead of using dynamic hardware muxes to multiply the immediate by 2, we reduce instruction signal fanout and immediate mux costs by around a factor of 2. The scrambled immediate encoding will add negligible time to static or ahead-of-time compilation. For dynamic generation of instructions, there is some small additional overhead, but the most common short forward branches have straightforward immediate encodings.
Most integer computational instructions operate on XLEN bits of values held in the integer register file. Integer computational instructions are either encoded as register-immediate operations using the I-type format or as register-register operations using the R-type format. The destination is register rd for both register-immediate and register-register instructions. No integer computational instructions cause arithmetic exceptions.
We did not include special instruction-set support for overflow checks
on integer arithmetic operations in the base instruction set, as many
overflow checks can be cheaply implemented using RISC-V branches.
Overflow checking for unsigned addition requires only a single
additional branch instruction after the addition:
add t0, t1, t2; bltu t0, t1, overflow
.
For signed addition, if one operand’s sign is known, overflow checking
requires only a single branch after the addition:
addi t0, t1, +imm; blt t0, t1, overflow
. This covers the common case
of addition with an immediate operand.
For general signed addition, three additional instructions after the addition are required, leveraging the observation that the sum should be less than one of the operands if and only if the other operand is negative.
add t0, t1, t2
slti t3, t2, 0
slt t4, t0, t1
bne t3, t4, overflow
In RV64I, checks of 32-bit signed additions can be optimized further by comparing the results of ADD and ADDW on the operands.
M | R | S | R | O |
---|---|---|---|---|
5 | 3 | 5 | 7 | |
I-immediate[11:0] | src | ADDI/SLTI[U] | dest | OP-IMM |
I-immediate[11:0] | src | ANDI/ORI/XORI | dest | OP-IMM |
ADDI adds the sign-extended 12-bit immediate to register rs1. Arithmetic overflow is ignored and the result is simply the low XLEN bits of the result. ADDI rd, rs1, 0 is used to implement the MV rd, rs1 assembler pseudoinstruction.
SLTI (set less than immediate) places the value 1 in register rd if register rs1 is less than the sign-extended immediate when both are treated as signed numbers, else 0 is written to rd. SLTIU is similar but compares the values as unsigned numbers (i.e., the immediate is first sign-extended to XLEN bits then treated as an unsigned number). Note, SLTIU rd, rs1, 1 sets rd to 1 if rs1 equals zero, otherwise sets rd to 0 (assembler pseudoinstruction SEQZ rd, rs).
ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, and XOR on register rs1 and the sign-extended 12-bit immediate and place the result in rd. Note, XORI rd, rs1, -1 performs a bitwise logical inversion of register rs1 (assembler pseudoinstruction NOT rd, rs).
S | R | R | S | R | O |
---|---|---|---|---|---|
5 | 5 | 3 | 5 | 7 | |
0000000 | shamt[4:0] | src | SLLI | dest | OP-IMM |
0000000 | shamt[4:0] | src | SRLI | dest | OP-IMM |
0100000 | shamt[4:0] | src | SRAI | dest | OP-IMM |
Shifts by a constant are encoded as a specialization of the I-type format. The operand to be shifted is in rs1, and the shift amount is encoded in the lower 5 bits of the I-immediate field. The right shift type is encoded in bit 30. SLLI is a logical left shift (zeros are shifted into the lower bits); SRLI is a logical right shift (zeros are shifted into the upper bits); and SRAI is an arithmetic right shift (the original sign bit is copied into the vacated upper bits).
U | R | O |
---|---|---|
5 | 7 | |
U-immediate[31:12] | dest | LUI |
U-immediate[31:12] | dest | AUIPC |
LUI (load upper immediate) is used to build 32-bit constants and uses the U-type format. LUI places the 32-bit U-immediate value into the destination register rd, filling in the lowest 12 bits with zeros.
AUIPC (add upper immediate to pc
) is used to build pc
-relative
addresses and uses the U-type format. AUIPC forms a 32-bit offset from
the U-immediate, filling in the lowest 12 bits with zeros, adds this
offset to the address of the AUIPC instruction, then places the result
in register rd.
The assembly syntax for lui
and auipc
does not represent the lower
12 bits of the U-immediate, which are always zero.
The AUIPC instruction supports two-instruction sequences to access
arbitrary offsets from the pc
for both control-flow transfers and data
accesses. The combination of an AUIPC and the 12-bit immediate in a JALR
can transfer control to any 32-bit pc
-relative address, while an AUIPC
plus the 12-bit immediate offset in regular load or store instructions
can access any 32-bit pc
-relative data address.
The current pc
can be obtained by setting the U-immediate to 0.
Although a JAL +4 instruction could also be used to obtain the local
pc
(of the instruction following the JAL), it might cause pipeline
breaks in simpler microarchitectures or pollute branch-target buffer
structures in more complex microarchitectures.
RV32I defines several arithmetic R-type operations. All operations read the rs1 and rs2 registers as source operands and write the result into register rd. The funct7 and funct3 fields select the type of operation.
S | R | R | S | R | O |
---|---|---|---|---|---|
5 | 5 | 3 | 5 | 7 | |
0000000 | src2 | src1 | ADD/SLT[U] | dest | OP |
0000000 | src2 | src1 | AND/OR/XOR | dest | OP |
0000000 | src2 | src1 | SLL/SRL | dest | OP |
0100000 | src2 | src1 | SUB/SRA | dest | OP |
ADD performs the addition of rs1 and rs2. SUB performs the
subtraction of rs2 from rs1. Overflows are ignored and the low XLEN
bits of results are written to the destination rd. SLT and SLTU
perform signed and unsigned compares respectively, writing 1 to rd if
SLL, SRL, and SRA perform logical left, logical right, and arithmetic right shifts on the value in register rs1 by the shift amount held in the lower 5 bits of register rs2.
M | R | S | R | O |
---|---|---|---|---|
5 | 3 | 5 | 7 | |
0 | 0 | ADDI | 0 | OP-IMM |
The NOP instruction does not change any architecturally visible state,
except for advancing the pc
and incrementing any applicable
performance counters. NOP is encoded as ADDI x0, x0, 0.
NOPs can be used to align code segments to microarchitecturally significant address boundaries, or to leave space for inline code modifications. Although there are many possible ways to encode a NOP, we define a canonical NOP encoding to allow microarchitectural optimizations as well as for more readable disassembly output. The other NOP encodings are made available for HINT instructions (Section 1.9).
ADDI was chosen for the NOP encoding as this is most likely to take fewest resources to execute across a range of systems (if not optimized away in decode). In particular, the instruction only reads one register. Also, an ADDI functional unit is more likely to be available in a superscalar design as adds are the most common operation. In particular, address-generation functional units can execute ADDI using the same hardware needed for base+offset address calculations, while register-register ADD or logical/shift operations require additional hardware.
RV32I provides two types of control transfer instructions: unconditional jumps and conditional branches. Control transfer instructions in RV32I do not have architecturally visible delay slots.
If an instruction access-fault or instruction page-fault exception occurs on the target of a jump or taken branch, the exception is reported on the target instruction, not on the jump or branch instruction.
The jump and link (JAL) instruction uses the J-type format, where the
J-immediate encodes a signed offset in multiples of 2 bytes. The offset
is sign-extended and added to the address of the jump instruction to
form the jump target address. Jumps can therefore target a ± range. JAL
stores the address of the instruction that follows the JAL (pc
+4) into
register rd. The standard software calling convention uses x1
as the
return address register and x5
as an alternate link register.
The alternate link register supports calling millicode routines (e.g.,
those to save and restore registers in compressed code) while preserving
the regular return address register. The register x5
was chosen as the
alternate link register as it maps to a temporary in the standard
calling convention, and has an encoding that is only one bit different
than the regular link register.
Plain unconditional jumps (assembler pseudoinstruction J) are encoded as
a JAL with rd=x0
.
W | E | W | R | R | O |
---|---|---|---|---|---|
10 | 8 | 5 | 7 | ||
dest | JAL |
The indirect jump instruction JALR (jump and link register) uses the
I-type encoding. The target address is obtained by adding the
sign-extended 12-bit I-immediate to the register rs1, then setting the
least-significant bit of the result to zero. The address of the
instruction following the jump (pc
+4) is written to register rd.
Register x0
can be used as the destination if the result is not
required.
M | R | F | R | O |
---|---|---|---|---|
5 | 3 | 5 | 7 | |
offset[11:0] | base | 0 | dest | JALR |
The unconditional jump instructions all use pc
-relative addressing to
help support position-independent code. The JALR instruction was defined
to enable a two-instruction sequence to jump anywhere in a 32-bit
absolute address range. A LUI instruction can first load rs1 with the
upper 20 bits of a target address, then JALR can add in the lower bits.
Similarly, AUIPC then JALR can jump anywhere in a 32-bit pc
-relative
address range.
Note that the JALR instruction does not treat the 12-bit immediate as multiples of 2 bytes, unlike the conditional branch instructions. This avoids one more immediate format in hardware. In practice, most uses of JALR will have either a zero immediate or be paired with a LUI or AUIPC, so the slight reduction in range is not significant.
Clearing the least-significant bit when calculating the JALR target address both simplifies the hardware slightly and allows the low bit of function pointers to be used to store auxiliary information. Although there is potentially a slight loss of error checking in this case, in practice jumps to an incorrect instruction address will usually quickly raise an exception.
When used with a base rs1=x0
, JALR can be used to implement a single
instruction subroutine call to the lowest or highest address region from
anywhere in the address space, which could be used to implement fast
calls to a small runtime library. Alternatively, an ABI could dedicate a
general-purpose register to point to a library elsewhere in the address
space.
The JAL and JALR instructions will generate an instruction-address-misaligned exception if the target address is not aligned to an IALIGN-bit boundary.
Instruction-address-misaligned exceptions are not possible on machines with IALIGN=16, such as those that support the compressed instruction-set extension, C.
Return-address prediction stacks are a common feature of
high-performance instruction-fetch units, but require accurate detection
of instructions used for procedure calls and returns to be effective.
For RISC-V, hints as to the instructions’ usage are encoded implicitly
via the register numbers used. A JAL instruction should push the return
address onto a return-address stack (RAS) only when rd is x1
or
x5
. JALR instructions should push/pop a RAS as shown in the
Table 1.1.
rd is x1 /x5 |
rs1 is x1 /x5 |
rd=rs1 | RAS action |
---|---|---|---|
No | No | – | None |
No | Yes | – | Pop |
Yes | No | – | Push |
Yes | Yes | No | Pop, then push |
Yes | Yes | Yes | Push |
Return-address stack prediction hints encoded in the register operands of a JALR instruction.
Some other ISAs added explicit hint bits to their indirect-jump instructions to guide return-address stack manipulation. We use implicit hinting tied to register numbers and the calling convention to reduce the encoding space used for these hints.
When two different link registers (x1
and x5
) are given as rs1 and
rd, then the RAS is both popped and pushed to support coroutines. If
rs1 and rd are the same link register (either x1
or x5
), the RAS
is only pushed to enable macro-op fusion of the sequences:
lui ra, imm20; jalr ra, imm12(ra)
and
auipc ra, imm20; jalr ra, imm12(ra)
All branch instructions use the B-type instruction format. The 12-bit B-immediate encodes signed offsets in multiples of 2 bytes. The offset is sign-extended and added to the address of the branch instruction to give the target address. The conditional branch range is ±.
W | R | F | F | R | R | F | S |
---|---|---|---|---|---|---|---|
6 | 5 | 5 | 3 | 4 | 1 | 7 | |
src2 | src1 | BEQ/BNE | BRANCH | ||||
src2 | src1 | BLT[U] | BRANCH | ||||
src2 | src1 | BGE[U] | BRANCH |
Branch instructions compare two registers. BEQ and BNE take the branch if registers rs1 and rs2 are equal or unequal respectively. BLT and BLTU take the branch if rs1 is less than rs2, using signed and unsigned comparison respectively. BGE and BGEU take the branch if rs1 is greater than or equal to rs2, using signed and unsigned comparison respectively. Note, BGT, BGTU, BLE, and BLEU can be synthesized by reversing the operands to BLT, BLTU, BGE, and BGEU, respectively.
Signed array bounds may be checked with a single BLTU instruction, since any negative index will compare greater than any nonnegative bound.
Software should be optimized such that the sequential code path is the most common path, with less-frequently taken code paths placed out of line. Software should also assume that backward branches will be predicted taken and forward branches as not taken, at least the first time they are encountered. Dynamic predictors should quickly learn any predictable branch behavior.
Unlike some other architectures, the RISC-V jump (JAL with rd=x0
)
instruction should always be used for unconditional branches instead of
a conditional branch instruction with an always-true condition. RISC-V
jumps are also pc
-relative and support a much wider offset range than
branches, and will not pollute conditional-branch prediction tables.
The conditional branches were designed to include arithmetic comparison operations between two registers (as also done in PA-RISC, Xtensa, and MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or to only compare one register against zero (Alpha, MIPS), or two registers only for equality (MIPS). This design was motivated by the observation that a combined compare-and-branch instruction fits into a regular pipeline, avoids additional condition code state or use of a temporary register, and reduces static code size and dynamic instruction fetch traffic. Another point is that comparisons against zero require non-trivial circuit delay (especially after the move to static logic in advanced processes) and so are almost as expensive as arithmetic magnitude compares. Another advantage of a fused compare-and-branch instruction is that branches are observed earlier in the front-end instruction stream, and so can be predicted earlier. There is perhaps an advantage to a design with condition codes in the case where multiple branches can be taken based on the same condition codes, but we believe this case to be relatively rare.
We considered but did not include static branch hints in the instruction encoding. These can reduce the pressure on dynamic predictors, but require more instruction encoding space and software profiling for best results, and can result in poor performance if production runs do not match profiling runs.
We considered but did not include conditional moves or predicated instructions, which can effectively replace unpredictable short forward branches. Conditional moves are the simpler of the two, but are difficult to use with conditional code that might cause exceptions (memory accesses and floating-point operations). Predication adds additional flag state to a system, additional instructions to set and clear flags, and additional encoding overhead on every instruction. Both conditional move and predicated instructions add complexity to out-of-order microarchitectures, adding an implicit third source operand due to the need to copy the original value of the destination architectural register into the renamed destination physical register if the predicate is false. Also, static compile-time decisions to use predication instead of branches can result in lower performance on inputs not included in the compiler training set, especially given that unpredictable branches are rare, and becoming rarer as branch prediction techniques improve.
We note that various microarchitectural techniques exist to dynamically convert unpredictable short forward branches into internally predicated code to avoid the cost of flushing pipelines on a branch mispredict and have been implemented in commercial processors . The simplest techniques just reduce the penalty of recovering from a mispredicted short forward branch by only flushing instructions in the branch shadow instead of the entire fetch pipeline, or by fetching instructions from both sides using wide instruction fetch or idle instruction fetch slots. More complex techniques for out-of-order cores add internal predicates on instructions in the branch shadow, with the internal predicate value written by the branch instruction, allowing the branch and following instructions to be executed speculatively and out-of-order with respect to other code .
The conditional branch instructions will generate an instruction-address-misaligned exception if the target address is not aligned to an IALIGN-bit boundary and the branch condition evaluates to true. If the branch condition evaluates to false, the instruction-address-misaligned exception will not be raised.
Instruction-address-misaligned exceptions are not possible on machines with IALIGN=16, such as those that support the compressed instruction-set extension, C.
RV32I is a load-store architecture, where only load and store
instructions access memory and arithmetic instructions only operate on
CPU registers. RV32I provides a 32-bit address space that is
byte-addressed. The EEI will define what portions of the address space
are legal to access with which instructions (e.g., some addresses might
be read only, or support word access only). Loads with a destination of
x0
must still raise any exceptions and cause any other side effects
even though the load value is discarded.
The EEI will define whether the memory system is little-endian or big-endian. In RISC-V, endianness is byte-address invariant.
In a system for which endianness is byte-address invariant, the following property holds: if a byte is stored to memory at some address in some endianness, then a byte-sized load from that address in any endianness returns the stored value.
In a little-endian configuration, multibyte stores write the least-significant register byte at the lowest memory byte address, followed by the other register bytes in ascending order of their significance. Loads similarly transfer the contents of the lesser memory byte addresses to the less-significant register bytes.
In a big-endian configuration, multibyte stores write the most-significant register byte at the lowest memory byte address, followed by the other register bytes in descending order of their significance. Loads similarly transfer the contents of the greater memory byte addresses to the less-significant register bytes.
M | R | F | R | O |
---|---|---|---|---|
5 | 3 | 5 | 7 | |
offset[11:0] | base | width | dest | LOAD |
O | R | R | F | R | O |
---|---|---|---|---|---|
5 | 5 | 3 | 5 | 7 | |
offset[11:5] | src | base | width | offset[4:0] | STORE |
Load and store instructions transfer a value between the registers and memory. Loads are encoded in the I-type format and stores are S-type. The effective address is obtained by adding register rs1 to the sign-extended 12-bit offset. Loads copy a value from memory to register rd. Stores copy the value in register rs2 to memory.
The LW instruction loads a 32-bit value from memory into rd. LH loads a 16-bit value from memory, then sign-extends to 32-bits before storing in rd. LHU loads a 16-bit value from memory but then zero extends to 32-bits before storing in rd. LB and LBU are defined analogously for 8-bit values. The SW, SH, and SB instructions store 32-bit, 16-bit, and 8-bit values from the low bits of register rs2 to memory.
Regardless of EEI, loads and stores whose effective addresses are naturally aligned shall not raise an address-misaligned exception. Loads and stores whose effective address is not naturally aligned to the referenced datatype (i.e., the effective address is not divisible by the size of the access in bytes) have behavior dependent on the EEI.
An EEI may guarantee that misaligned loads and stores are fully supported, and so the software running inside the execution environment will never experience a contained or fatal address-misaligned trap. In this case, the misaligned loads and stores can be handled in hardware, or via an invisible trap into the execution environment implementation, or possibly a combination of hardware and invisible trap depending on address.
An EEI may not guarantee misaligned loads and stores are handled invisibly. In this case, loads and stores that are not naturally aligned may either complete execution successfully or raise an exception. The exception raised can be either an address-misaligned exception or an access-fault exception. For a memory access that would otherwise be able to complete except for the misalignment, an access-fault exception can be raised instead of an address-misaligned exception if the misaligned access should not be emulated, e.g., if accesses to the memory region have side effects. When an EEI does not guarantee misaligned loads and stores are handled invisibly, the EEI must define if exceptions caused by address misalignment result in a contained trap (allowing software running inside the execution environment to handle the trap) or a fatal trap (terminating execution).
Misaligned accesses are occasionally required when porting legacy code, and help performance on applications when using any form of packed-SIMD extension or handling externally packed data structures. Our rationale for allowing EEIs to choose to support misaligned accesses via the regular load and store instructions is to simplify the addition of misaligned hardware support. One option would have been to disallow misaligned accesses in the base ISAs and then provide some separate ISA support for misaligned accesses, either special instructions to help software handle misaligned accesses or a new hardware addressing mode for misaligned accesses. Special instructions are difficult to use, complicate the ISA, and often add new processor state (e.g., SPARC VIS align address offset register) or complicate access to existing processor state (e.g., MIPS LWL/LWR partial register writes). In addition, for loop-oriented packed-SIMD code, the extra overhead when operands are misaligned motivates software to provide multiple forms of loop depending on operand alignment, which complicates code generation and adds to loop startup overhead. New misaligned hardware addressing modes take considerable space in the instruction encoding or require very simplified addressing modes (e.g., register indirect only).
Even when misaligned loads and stores complete successfully, these accesses might run extremely slowly depending on the implementation (e.g., when implemented via an invisible trap). Furthermore, whereas naturally aligned loads and stores are guaranteed to execute atomically, misaligned loads and stores might not, and hence require additional synchronization to ensure atomicity.
We do not mandate atomicity for misaligned accesses so execution environment implementations can use an invisible machine trap and a software handler to handle some or all misaligned accesses. If hardware misaligned support is provided, software can exploit this by simply using regular load and store instructions. Hardware can then automatically optimize accesses depending on whether runtime addresses are aligned.
| F | IIIIIIIIF | F | F | S
|:- |:- |:- |:- |:- |:- |:- |:- |:- |:- |:- |:- |:-
| | | | | | | | | | | | |
| | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 3 | 5 | 7
| FM | | | 0 | FENCE | 0 | MISC-MEM
The FENCE instruction is used to order device I/O and memory accesses as viewed by other RISC-V harts and external devices or coprocessors. Any combination of device input (I), device output (O), memory reads (R), and memory writes (W) may be ordered with respect to any combination of the same. Informally, no other RISC-V hart or external device can observe any operation in the successor set following a FENCE before any operation in the predecessor set preceding the FENCE. Chapter [ch:memorymodel] provides a precise description of the RISC-V memory consistency model.
The FENCE instruction also orders memory reads and writes made by the hart as observed by memory reads and writes made by an external device. However, FENCE does not order observations of events made by an external device using any other signaling mechanism.
A device might observe an access to a memory location via some external communication mechanism, e.g., a memory-mapped control register that drives an interrupt signal to an interrupt controller. This communication is outside the scope of the FENCE ordering mechanism and hence the FENCE instruction can provide no guarantee on when a change in the interrupt signal is visible to the interrupt controller. Specific devices might provide additional ordering guarantees to reduce software overhead but those are outside the scope of the RISC-V memory model.
The EEI will define what I/O operations are possible, and in particular, which memory addresses when accessed by load and store instructions will be treated and ordered as device input and device output operations respectively rather than memory reads and writes. For example, memory-mapped I/O devices will typically be accessed with uncached loads and stores that are ordered using the I and O bits rather than the R and W bits. Instruction-set extensions might also describe new I/O instructions that will also be ordered using the I and O bits in a FENCE.
fm field | Mnemonic | Meaning |
---|---|---|
0000 | none | Normal Fence |
1000 | TSO | With FENCE RW,RW: exclude write-to-read ordering |
Otherwise: Reserved for future use. | ||
other | Reserved for future use. |
Fence mode encoding.
The fence mode field fm defines the semantics of the FENCE. A FENCE with fm=0000 orders all memory operations in its predecessor set before all memory operations in its successor set.
The FENCE.TSO instruction is encoded as a FENCE instruction with fm=1000, predecessor=RW, and successor=RW. FENCE.TSO orders all load operations in its predecessor set before all memory operations in its successor set, and all store operations in its predecessor set before all store operations in its successor set. This leaves non-AMO store operations in the FENCE.TSO’s predecessor set unordered with non-AMO loads in its successor set.
Because FENCE RW,RW imposes a superset of the orderings that FENCE.TSO imposes, it is correct to ignore the fm field and implement FENCE.TSO as FENCE RW,RW.
The unused fields in the FENCE instructions—rs1 and rd—are reserved for finer-grain fences in future extensions. For forward compatibility, base implementations shall ignore these fields, and standard software shall zero these fields. Likewise, many fm and predecessor/successor set settings in Table 1.2 are also reserved for future use. Base implementations shall treat all such reserved configurations as normal fences with fm=0000, and standard software shall use only non-reserved configurations.
We chose a relaxed memory model to allow high performance from simple machine implementations and from likely future coprocessor or accelerator extensions. We separate out I/O ordering from memory R/W ordering to avoid unnecessary serialization within a device-driver hart and also to support alternative non-memory paths to control added coprocessors or I/O devices. Simple implementations may additionally ignore the predecessor and successor fields and always execute a conservative fence on all operations.
SYSTEM instructions are used to access system functionality that might require privileged access and are encoded using the I-type instruction format. These can be divided into two main classes: those that atomically read-modify-write control and status registers (CSRs), and all other potentially privileged instructions. CSR instructions are described in Chapter [csrinsts], and the base unprivileged instructions are described in the following section.
The SYSTEM instructions are defined to allow simpler implementations to always trap to a single software trap handler. More sophisticated implementations might execute more of each system instruction in hardware.
| M | R | F | R | S
|:- |:- |:- |:- |:- :-
| | | | |
| | 5 | 3 | 5 | 7
| ECALL | 0 | PRIV | 0 | SYSTEM
| EBREAK | 0 | PRIV | 0 | SYSTEM
These two instructions cause a precise requested trap to the supporting execution environment.
The ECALL instruction is used to make a service request to the execution environment. The EEI will define how parameters for the service request are passed, but usually these will be in defined locations in the integer register file.
The EBREAK instruction is used to return control to a debugging environment.
ECALL and EBREAK were previously named SCALL and SBREAK. The instructions have the same functionality and encoding, but were renamed to reflect that they can be used more generally than to call a supervisor-level operating system or debugger.
EBREAK was primarily designed to be used by a debugger to cause execution to stop and fall back into the debugger. EBREAK is also used by the standard gcc compiler to mark code paths that should not be executed.
Another use of EBREAK is to support “semihosting”, where the execution environment includes a debugger that can provide services over an alternate system call interface built around the EBREAK instruction. Because the RISC-V base ISAs do not provide more than one EBREAK instruction, RISC-V semihosting uses a special sequence of instructions to distinguish a semihosting EBREAK from a debugger inserted EBREAK.
slli x0, x0, 0x1f # Entry NOP
ebreak # Break to debugger
srai x0, x0, 7 # NOP encoding the semihosting call number 7
Note that these three instructions must be 32-bit-wide instructions, i.e., they mustn’t be among the compressed 16-bit instructions described in Chapter [compressed].
The shift NOP instructions are still considered available for use as HINTs.
Semihosting is a form of service call and would be more naturally encoded as an ECALL using an existing ABI, but this would require the debugger to be able to intercept ECALLs, which is a newer addition to the debug standard. We intend to move over to using ECALLs with a standard ABI, in which case, semihosting can share a service ABI with an existing standard.
We note that ARM processors have also moved to using SVC instead of BKPT for semihosting calls in newer designs.
RV32I reserves a large encoding space for HINT instructions, which are
usually used to communicate performance hints to the microarchitecture.
Like the NOP instruction, HINTs do not change any architecturally
visible state, except for advancing the pc
and any applicable
performance counters. Implementations are always allowed to ignore the
encoded hints.
Most RV32I HINTs are encoded as integer computational instructions with
rd=x0
. The other RV32I HINTs are encoded as FENCE instructions with
a null predecessor or successor set and with fm=0.
These HINT encodings have been chosen so that simple implementations can
ignore HINTs altogether, and instead execute a HINT as a regular
instruction that happens not to mutate the architectural state. For
example, ADD is a HINT if the destination register is x0
; the five-bit
rs1 and rs2 fields encode arguments to the HINT. However, a simple
implementation can simply execute the HINT as an ADD of rs1 and rs2
that writes x0
, which has no architecturally visible effect.
As another example, a FENCE instruction with a zero pred field and a zero fm field is a HINT; the succ, rs1, and rd fields encode the arguments to the HINT. A simple implementation can simply execute the HINT as a FENCE that orders the null set of prior memory accesses before whichever subsequent memory accesses are encoded in the succ field. Since the intersection of the predecessor and successor sets is null, the instruction imposes no memory orderings, and so it has no architecturally visible effect.
Table [tab:rv32i-hints] lists all RV32I HINT code points. 91% of the HINT space is reserved for standard HINTs. The remainder of the HINT space is designated for custom HINTs: no standard HINTs will ever be defined in this subspace.
We anticipate standard hints to eventually include memory-system spatial and temporal locality hints, branch prediction hints, thread-scheduling hints, security tags, and instrumentation flags for simulation/emulation.
|l|l|c|l| Instruction | Constraints | Code Points | Purpose |
---|---|---|---|
LUI | rd=x0 |
220 | |
AUIPC | rd=x0 |
220 | |
rd=x0 , and either |
|||
rs1≠x0 or imm≠0 |
|||
ANDI | rd=x0 |
217 | |
ORI | rd=x0 |
217 | |
XORI | rd=x0 |
217 | |
ADD | rd=x0 , rs1≠x0 |
210 − 32 | |
rd=x0 , rs1=x0 , |
|||
rs2≠x2 –x5 |
|||
(rs2=x2 ) NTL.P1 |
|||
(rs2=x3 ) NTL.PALL |
|||
(rs2=x4 ) NTL.S1 |
|||
(rs2=x5 ) NTL.ALL |
|||
SUB | rd=x0 |
210 | |
AND | rd=x0 |
210 | |
OR | rd=x0 |
210 | |
XOR | rd=x0 |
210 | |
SLL | rd=x0 |
210 | |
SRL | rd=x0 |
210 | |
SRA | rd=x0 |
210 | |
rd=x0 , rs1≠x0 , |
|||
fm=0, and either | |||
pred=0 or succ=0 | |||
rd≠x0 , rs1=x0 , |
|||
fm=0, and either | |||
pred=0 or succ=0 | |||
rd=rs1=x0 , fm=0, |
|||
pred=0, succ≠0 | |||
rd=rs1=x0 , fm=0, |
|||
pred≠W, succ=0 | |||
rd=rs1=x0 , fm=0, |
|||
pred=W, succ=0 | |||
SLTI | rd=x0 |
217 | |
SLTIU | rd=x0 |
217 | |
SLLI | rd=x0 |
210 | |
SRLI | rd=x0 |
210 | |
SRAI | rd=x0 |
210 | |
SLT | rd=x0 |
210 | |
SLTU | rd=x0 |
210 |
This chapter describes the standard integer multiplication and division instruction extension, which is named “M” and contains instructions that multiply or divide values held in two integer registers.
We separate integer multiply and divide out from the base to simplify low-end implementations, or for applications where integer multiply and divide operations are either infrequent or better handled in attached accelerators.
S | R | R | S | R | O |
---|---|---|---|---|---|
5 | 5 | 3 | 5 | 7 | |
MULDIV | multiplier | multiplicand | MUL/MULH[[S]U] | dest | OP |
MULDIV | multiplier | multiplicand | MULW | dest | OP-32 |
MUL performs an XLEN-bit×XLEN-bit multiplication of rs1 by rs2 and places the lower XLEN bits in the destination register. MULH, MULHU, and MULHSU perform the same multiplication but return the upper XLEN bits of the full 2×XLEN-bit product, for signed×signed, unsigned×unsigned, and × multiplication, respectively. If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.
MULHSU is used in multi-word signed multiplication to multiply the most-significant word of the multiplicand (which contains the sign bit) with the less-significant words of the multiplier (which are unsigned).
MULW is an RV64 instruction that multiplies the lower 32 bits of the source registers, placing the sign-extension of the lower 32 bits of the result into the destination register.
In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit product, but signed arguments must be proper 32-bit signed values, whereas unsigned arguments must have their upper 32 bits clear. If the arguments are not known to be sign- or zero-extended, an alternative is to shift both arguments left by 32 bits, then use MULH[[S]U].
S | R | R | O | R | O |
---|---|---|---|---|---|
5 | 5 | 3 | 5 | 7 | |
MULDIV | divisor | dividend | DIV[U]/REM[U] | dest | OP |
MULDIV | divisor | dividend | DIV[U]W/REM[U]W | dest | OP-32 |
DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned integer division of rs1 by rs2, rounding towards zero. REM and REMU provide the remainder of the corresponding division operation. For REM, the sign of the result equals the sign of the dividend.
For both signed and unsigned division, it holds that dividend = divisor × quotient + remainder.
If both the quotient and remainder are required from the same division, the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr, rs1, rs2 (rdq cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single divide operation instead of performing two separate divides.
DIVW and DIVUW are RV64 instructions that divide the lower 32 bits of rs1 by the lower 32 bits of rs2, treating them as signed and unsigned integers respectively, placing the 32-bit quotient in rd, sign-extended to 64 bits. REMW and REMUW are RV64 instructions that provide the corresponding signed and unsigned remainder operations respectively. Both REMW and REMUW always sign-extend the 32-bit result to 64 bits, including on a divide by zero.
The semantics for division by zero and division overflow are summarized in Table 1.1. The quotient of division by zero has all bits set, and the remainder of division by zero equals the dividend. Signed division overflow occurs only when the most-negative integer is divided by − 1. The quotient of a signed division with overflow is equal to the dividend, and the remainder is zero. Unsigned division overflow cannot occur.
Condition | Dividend | Divisor | DIVU[W] | REMU[W] | DIV[W] | REM[W] |
---|---|---|---|---|---|---|
Division by zero | x | 0 | 2L − 1 | x | − 1 | x |
Overflow (signed only) | − 2L − 1 | − 1 | – | – | − 2L − 1 | 0 |
Semantics for division by zero and division overflow. L is the width of the operation in bits: XLEN for DIV[U] and REM[U], or 32 for DIV[U]W and REM[U]W.
We considered raising exceptions on integer divide by zero, with these exceptions causing a trap in most execution environments. However, this would be the only arithmetic trap in the standard ISA (floating-point exceptions set flags and write default values, but do not cause traps) and would require language implementors to interact with the execution environment’s trap handlers for this case. Further, where language standards mandate that a divide-by-zero exception must cause an immediate control flow change, only a single branch instruction needs to be added to each divide operation, and this branch instruction can be inserted after the divide and should normally be very predictably not taken, adding little runtime overhead.
The value of all bits set is returned for both unsigned and signed divide by zero to simplify the divider circuitry. The value of all 1s is both the natural value to return for unsigned divide, representing the largest unsigned number, and also the natural result for simple unsigned divider implementations. Signed division is often implemented using an unsigned division circuit and specifying the same overflow result simplifies the hardware.
The Zmmul extension implements the multiplication subset of the M extension. It adds all of the instructions defined in Section 1.1, namely: MUL, MULH, MULHU, MULHSU, and (for RV64 only) MULW. The encodings are identical to those of the corresponding M-extension instructions.
The Zmmul extension enables low-cost implementations that require multiplication operations but not division. For many microcontroller applications, division operations are too infrequent to justify the cost of divider hardware. By contrast, multiplication operations are more frequent, making the cost of multiplier hardware more justifiable. Simple FPGA soft cores particularly benefit from eliminating division but retaining multiplication, since many FPGAs provide hardwired multipliers but require dividers be implemented in soft logic.
RISC-V defines a separate address space of 4096 Control and Status registers associated with each hart. This chapter defines the full set of CSR instructions that operate on these CSRs.
While CSRs are primarily used by the privileged architecture, there are several uses in unprivileged code including for counters and timers, and for floating-point status.
The counters and timers are no longer considered mandatory parts of the standard base ISAs, and so the CSR instructions required to access them have been moved out of Chapter [rv32] into this separate chapter.
All CSR instructions atomically read-modify-write a single CSR, whose CSR specifier is encoded in the 12-bit csr field of the instruction held in bits 31–20. The immediate forms use a 5-bit zero-extended immediate encoded in the rs1 field.
M | R | F | R | S |
---|---|---|---|---|
5 | 3 | 5 | 7 | |
source/dest | source | CSRRW | dest | SYSTEM |
source/dest | source | CSRRS | dest | SYSTEM |
source/dest | source | CSRRC | dest | SYSTEM |
source/dest | uimm[4:0] | CSRRWI | dest | SYSTEM |
source/dest | uimm[4:0] | CSRRSI | dest | SYSTEM |
source/dest | uimm[4:0] | CSRRCI | dest | SYSTEM |
The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in
the CSRs and integer registers. CSRRW reads the old value of the CSR,
zero-extends the value to XLEN bits, then writes it to integer register
rd. The initial value in rs1 is written to the CSR. If rd=x0
,
then the instruction shall not read the CSR and shall not cause any of
the side effects that might occur on a CSR read.
The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value of the CSR, zero-extends the value to XLEN bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be set in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be set in the CSR, if that CSR bit is writable. Other bits in the CSR are not explicitly written.
The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the value of the CSR, zero-extends the value to XLEN bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be cleared in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be cleared in the CSR, if that CSR bit is writable. Other bits in the CSR are not explicitly written.
For both CSRRS and CSRRC, if rs1=x0
, then the instruction will not
write to the CSR at all, and so shall not cause any of the side effects
that might otherwise occur on a CSR write, nor raise illegal instruction
exceptions on accesses to read-only CSRs. Both CSRRS and CSRRC always
read the addressed CSR and cause any read side effects regardless of
rs1 and rd fields. Note that if rs1 specifies a register holding a
zero value other than x0
, the instruction will still attempt to write
the unmodified value back to the CSR and will cause any attendant side
effects. A CSRRW with rs1=x0
will attempt to write zero to the
destination CSR.
The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and
CSRRC respectively, except they update the CSR using an XLEN-bit value
obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0])
field encoded in the rs1 field instead of a value from an integer
register. For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then
these instructions will not write to the CSR, and shall not cause any of
the side effects that might otherwise occur on a CSR write, nor raise
illegal instruction exceptions on accesses to read-only CSRs. For
CSRRWI, if rd=x0
, then the instruction shall not read the CSR and
shall not cause any of the side effects that might occur on a CSR read.
Both CSRRSI and CSRRCI will always read the CSR and cause any read side
effects regardless of rd and rs1 fields.
Register operand | ||||
---|---|---|---|---|
Instruction | rd is x0 |
rs1 is x0 |
Reads CSR | Writes CSR |
CSRRW | Yes | – | No | Yes |
CSRRW | No | – | Yes | Yes |
CSRRS/CSRRC | – | Yes | Yes | No |
CSRRS/CSRRC | – | No | Yes | Yes |
Immediate operand | ||||
Instruction | rd is x0 |
uimm=0 | Reads CSR | Writes CSR |
CSRRWI | Yes | – | No | Yes |
CSRRWI | No | – | Yes | Yes |
CSRRSI/CSRRCI | – | Yes | Yes | No |
CSRRSI/CSRRCI | – | No | Yes | Yes |
Conditions determining whether a CSR instruction reads or writes the specified CSR.
Table 1.1 summarizes the behavior of the CSR instructions with respect to whether they read and/or write the CSR.
For any event or consequence that occurs due to a CSR having a particular value, if a write to the CSR gives it that value, the resulting event or consequence is said to be an indirect effect of the write. Indirect effects of a CSR write are not considered by the RISC-V ISA to be side effects of that write.
An example of side effects for CSR accesses would be if reading from a specific CSR causes a light bulb to turn on, while writing an odd value to the same CSR causes the light to turn off. Assume writing an even value has no effect. In this case, both the read and write have side effects controlling whether the bulb is lit, as this condition is not determined solely from the CSR value. (Note that after writing an odd value to the CSR to turn off the light, then reading to turn the light on, writing again the same odd value causes the light to turn off again. Hence, on the last write, it is not a change in the CSR value that turns off the light.)
On the other hand, if a bulb is rigged to light whenever the value of a particular CSR is odd, then turning the light on and off is not considered a side effect of writing to the CSR but merely an indirect effect of such writes.
More concretely, the RISC-V privileged architecture defined in Volume II specifies that certain combinations of CSR values cause a trap to occur. When an explicit write to a CSR creates the conditions that trigger the trap, the trap is not considered a side effect of the write but merely an indirect effect.
Standard CSRs do not have any side effects on reads. Standard CSRs may have side effects on writes. Custom extensions might add CSRs for which accesses have side effects on either reads or writes.
Some CSRs, such as the instructions-retired counter, instret
, may be
modified as side effects of instruction execution. In these cases, if a
CSR access instruction reads a CSR, it reads the value prior to the
execution of the instruction. If a CSR access instruction writes such a
CSR, the write is done instead of the increment. In particular, a value
written to instret
by one instruction will be the value read by the
following instruction.
The assembler pseudoinstruction to read a CSR, CSRR rd, csr, is encoded as CSRRS rd, csr, x0. The assembler pseudoinstruction to write a CSR, CSRW csr, rs1, is encoded as CSRRW x0, csr, rs1, while CSRWI csr, uimm, is encoded as CSRRWI x0, csr, uimm.
Further assembler pseudoinstructions are defined to set and clear bits in the CSR when the old value is not required: CSRS/CSRC csr, rs1; CSRSI/CSRCI csr, uimm.
Each RISC-V hart normally observes its own CSR accesses, including its implicit CSR accesses, as performed in program order. In particular, unless specified otherwise, a CSR access is performed after the execution of any prior instructions in program order whose behavior modifies or is modified by the CSR state and before the execution of any subsequent instructions in program order whose behavior modifies or is modified by the CSR state. Furthermore, an explicit CSR read returns the CSR state before the execution of the instruction, while an explicit CSR write suppresses and overrides any implicit writes or modifications to the same CSR by the same instruction.
Likewise, any side effects from an explicit CSR access are normally observed to occur synchronously in program order. Unless specified otherwise, the full consequences of any such side effects are observable by the very next instruction, and no consequences may be observed out-of-order by preceding instructions. (Note the distinction made earlier between side effects and indirect effects of CSR writes.)
For the RVWMO memory consistency model (Chapter [ch:memorymodel]), CSR accesses are weakly ordered by default, so other harts or devices may observe CSR accesses in an order different from program order. In addition, CSR accesses are not ordered with respect to explicit memory accesses, unless a CSR access modifies the execution behavior of the instruction that performs the explicit memory access or unless a CSR access and an explicit memory access are ordered by either the syntactic dependencies defined by the memory model or the ordering requirements defined by the Memory-Ordering PMAs section in Volume II of this manual. To enforce ordering in all other cases, software should execute a FENCE instruction between the relevant accesses. For the purposes of the FENCE instruction, CSR read accesses are classified as device input (I), and CSR write accesses are classified as device output (O).
Informally, the CSR space acts as a weakly ordered memory-mapped I/O region, as defined by the Memory-Ordering PMAs section in Volume II of this manual. As a result, the order of CSR accesses with respect to all other accesses is constrained by the same mechanisms that constrain the order of memory-mapped I/O accesses to such a region.
These CSR-ordering constraints are imposed to support ordering main
memory and memory-mapped I/O accesses with respect to CSR accesses that
are visible to, or affected by, devices or other harts. Examples include
the time
, cycle
, and mcycle
CSRs, in addition to CSRs that reflect
pending interrupts, like mip
and sip
. Note that implicit reads of
such CSRs (e.g., taking an interrupt because of a change in mip
) are
also ordered as device input.
Most CSRs (including, e.g., the fcsr
) are not visible to other harts;
their accesses can be freely reordered in the global memory order with
respect to FENCE instructions without violating this specification.
The hardware platform may define that accesses to certain CSRs are strongly ordered, as defined by the Memory-Ordering PMAs section in Volume II of this manual. Accesses to strongly ordered CSRs have stronger ordering constraints with respect to accesses to both weakly ordered CSRs and accesses to memory-mapped I/O regions.
The rules for the reordering of CSR accesses in the global memory order should probably be moved to Chapter [ch:memorymodel] concerning the RVWMO memory consistency model.