RVA23 vs ARMv9 a Small Experiment

I was curious to see how RISC-V and ARM compare in terms of dynamic instruction count and code density, so I devised a small experiment to compare the ISAs.

As a test codebase, I choose the chibicc C compiler, because it's a medium size project and is quite easy to compile. To benchmark chibicc I just used it to compile itself, which should be a quite realistic workload to simulate a complex non-regular application. I merged all files into one and did some minor modifications, the code can be found at: https://godbolt.org/z/xr3nEW8Wf

You may notice that I added unoptimized scalar implementations of the mem* and str* functions from musl-libc. This is because I decided to not include SIMD code in this experiment, in an effort to remove more unknown variables and focus on comparing the base ISAs. Without these measures, the results seemed similar.

Without further ado, below is the table comparing the results:

		static	QEMU			GEM5
ISA	Compiler	Bytes	Bytes	Insns	uops**	Sim-Time	Insns	uops
RVA22	clang-19	772K	1221M	424M	+0M = 424M	0.221s	438M	+0M = 438M
RVA22	gcc-15	772K	1309M	445M	+0M = 445M	0.217s	459M	+0M = 459M
RVA23*	clang-19	772K	1185M	423M	+0M = 423M	0.243s	438M	+0M = 438M
RVA23*	gcc-15	772K	1265M	441M	+0M = 441M	0.217s	456M	+0M = 456M
armv9*	clang-19	944K	1543M	386M	+39M = 424M	0.225s	399M	+70M = 469M
armv9*	gcc-15	936K	1688M	422M	+39M = 460M	0.236s	435M	+66M = 501M

*excluding SIMD/vector instructions

**Derived from the "Apple Silicon CPU Optimization Guide": ((ld|st)\w.*#\w*(\]!)*$)|ldp|stp, so load/store pairs and pre-/post-index load/stores

As mentioned before, I'm explicitly excluding SIMD from this experiment, so I used the following compiler arguments to achieve this:

RVA22: -O3 -march=rv64gcb -static
RVA23: -O3 -march=rv64gcb_zcb_zfa_zicond -static
armv9: -O3 -march=armv9-a+nosimd+nosve -static

Let's go through the results from left to right.

Firstly, the static sizes of the RISC-V binaries is about 18% smaller than the sizes of the ARM ones. The sizes are extremely similar for both RVA22 and RVA23, and regardless of compiler. For the dynamic instruction size however, that is the sum of executed-instructions-lengths, there is a clear improvement going from RVA22 to RVA23. RVA22 needs to fetch 21% fewer bytes than armv9 and RVA23 goes down to 24% fewer bytes than armv9. AFAIK QEMU doesn't come with the ability to count the executed-instruction-lengths out of the box, so I needed to patched the tcg/plugins/insn.c plugin.

When it comes to dynamic instruction count, ARM ends up the clear winner with on average of 6.5% fewer instructions that need to be decoded. But the instructions them self isn't what ends up executing in the backend. Due to the more complex addressing modes present in ARM, but not in RISC-V, the CPU needs to crack some of them into multiple micro operations (uops). The "Apple Silicon CPU Optimization Guide" tells use that Apple Silicon processors primarily crack load/store pairs and pre-/post-index load/stores into two uops. To take this into account, I modified the tcg/plugins/insn.c plugin again, to allow me to count instructions matching a regex. Adjusted for uops the field evens out again and the clang codegen for each ISA ends up with roughly the same amount of uops. On GCC the RISC-V codegen even ends up with fewer uops.

GEM5 is a cycle accurate micro architectural simulator, which should theoretically allow us to simulate almost the same micro-architecture (O3 CPU model) for different ISAs. The GEM5 simulated time should however definitely be taken with a large grain of salt. In practice this it's unlikely that the two GEM5 models reflect what the same design team with the same budget, but different ISA targets, would arrive at. Still, we can observe that in this particular test, both RVA22 and RVA23 ended up executing faster than armv9. Interestingly, the gcc binaries took less time to execute than the clang binaries even though clang had a smaller dynamic instruction and uop count. Also, for some reason, clang ended up with worse results from RVA23 than RVA22.

The instruction count between QEMU and GEM5 differs, this is presumably an artifact of running with qemu user-mode emulation. The uop count is also different, and the ARM binaries ends up with a larger uop count than the RISC-V ones, which might explain the performance difference. I sadly couldn't figure out which instructions are cracked by the ARM GEM5 O3 performance model.

So in conclusion, in this small experiment, ARM and RISC-V roughly match in the number of uops they needed to feed their backends. ARM needed to decode fewer instructions, but crack some of them into two uops. RISC-V on the other hand needed to fetch fewer bytes overall, due to its compressed instruction encoding, but needed to decode more total instructions directly into uops, with very little instruction cracking. The difference between RVA22 and RVA23 in this codebase seems to be negligible. This is presumably because the biggest difference between the profiles is this addition of RVV support, which was disabled for this test.

camel-cdr/rv-vs-arm-chibicc.md

RVA23 vs ARMv9 a Small Experiment