This is a comparison between an AVR softcore and the picorv32 when used in
a real application. The numbers are for the entire application, not for the cores in isolation.
Only minimal changes were made to the HDL and firmware code to adapt to the new core (the
peripherals were updated to handle a 32-bit bus, and the interrupt handling was adapted to
picorv32, otherwise the code is identical) and the same optimization flags (-Os -mrelax
)
were used with both cores.
The following were the relevant design constraints:
- Synthetizable on an iCE40 HX8K
- fCPU = 17 MHz
- CPU memory size: 12 K (8K program + 4K data on AVR)
Numbers in parentesis are compared to the reference (AVR) value. Lower is better for all metrics except fCPU_MAX.
Metric | AVR | picorv32 RV32I | picorv32 RV32IC | picorv32 RV32E | picorv32 RV32EC |
---|---|---|---|---|---|
LUTs | 5508 / 7680 | 3531 / 7680 (0.64) | 3789/ 7680 (0.69) | 3438/ 7680 (0.62) | 3816/ 7680 (0.69) |
BRAMs | 28 / 32 | 32 / 32 (1.14) | 32 / 32 (1.14) | 32 / 32 (1.14) | 32 / 32 (1.14) |
FW size | 7594 + 3092 = 10686 | ** 12412 (1.16) ** | 9828 (0.92) | 12096 (1.13) | 9648 (0.90) |
fCPU_MAX | 24.52 MHz | 24.11 MHz (0.98) | 19.24 MHz (0.78) | 23.14 MHz (0.94) | 18.40 MHz (0.75) |
CPI | 1.5 | 4 (2.66) | 4.5 (3.00) | 4 (2.66) | 4 (2.66) |
Runtime | 3437690 µs | 4848353 µs (1.40) | 4985692 µs (1.45) | 4808335 µs (1.40) | 4930170 µs (1.43) |
Conclusions:
- picorv32 uses much fewer LUTs than the AVR core. On the downside, it uses more BRAMs, which are a scarce resource on the HX8K.
- Code density of RV32 is worse than AVR, unless the "C" extension is used, then it is better.
- The RV32I build failed to meet the memory size design constraint.
- In order to reclaim the performance lost due to the lower CPI, it would probably be necessary to change the design to double the fCPU to 34 MHz. However this is larger than the reported fCPU_MAX.
- It's rather surprising that the fCPU_MAX reported by
nextpnr
is lower for picorv32 than for the AVR core, considering that the picorv32 is claiming to be designed for high frequencies. Maybe using the lookahead memory interface can help increase the max clock?
Part 2:
Next the VexRiscv core was tested. This more complicated to build due to the Verilog code being autogenerated by a Scala program. In return it supports many options to balance performance against core size. It turns out that by enabling all "bypass" and "early" options, a core with a size and CPI similar to that of the AVR, while allowing for a higher frequency, can be obtained. A full barrel shifter and an interative mul/div plugin with unroll factor 2 was also included in the cores tested below. The runtime figure is for an fCPU of 34 MHz.
Metric | AVR | VexRiscv RV32IM | VexRiscv RV32IMC | VexRiscv RV32EM | VexRiscv RV32EMC |
---|---|---|---|---|---|
LUTs | 5508 / 7680 | 4796 / 7680 (0.87) | 4982/ 7680 (0.90) | 4805/ 7680 (0.87) | 5004/ 7680 (0.91) |
BRAMs | 28 / 32 | 32 / 32 (1.14) | 32 / 32 (1.14) | 32 / 32 (1.14) | 32 / 32 (1.14) |
FW size | 7594 + 3092 = 10686 | 12056 (1.12) | 9604 (0.90) | 11780 (1.10) | 9444 (0.88) |
fCPU_MAX | 24.52 MHz | 40.78 MHz (1.66) | 41.70 MHz (1.70) | 40.69 MHz (1.66) | 39.48 MHz (1.61) |
CPI | 1.5 | 1.57 (1.04) | 1.57 (1.04) | 1.57 (1.04) | 1.57 (1.04) |
Runtime | 3437690 µs | 3159388 µs (0.92) | 3169157 µs (0.92) | 3154022 µs (0.92) | 3165122 µs (0.92) |
Conclusions:
- By doubling fCPU from 17 MHz to 34 MHz, a net performance gain was achieved compared to AVR. FW size (when using RVC) and FPGA resource utilization (apart from BRAMs) is also slightly better.
- Multiplication of 8 bit numbers is slower than AVR since the
MulDivIterativePlugin
takes 16 cycles at unroll factor 2 (corresponding to 8 AVR cycles, due to the higher fCPU), while the AVR has single cycle multiplication. Increasing the unroll factor above 2 would have a negative effect on fCPU_MAX. - The VexRiscv core supports the standard RISC-V interrupt model, which means that GCC's attribute((interrupt)) can be used, resulting in a smaller code footprint than on picorv32.
- Higher fCPU_MAX could be acheveied with a CPU memory which is a power of 2 in size. However the application does not fit in 8K and there is not enough BRAM in the HX8K to make it 16K (since a few BRAMs are needed for other purposes).
Update:
With the CPU memory using the look-ahead interface, fCPU_MAX doubled, which allowed me to reach fCPU = 34 MHz (2x).
The result is still slower than the AVR though. Ideally I'd like to run the CPU core at the same speed as the rest of the design, which is 68 MHz...