Skip to content

Instantly share code, notes, and snippets.

@camel-cdr
Created March 4, 2025 21:43
Show Gist options
  • Save camel-cdr/3a7aed17e017e8cab675ad696c7d14af to your computer and use it in GitHub Desktop.
Save camel-cdr/3a7aed17e017e8cab675ad696c7d14af to your computer and use it in GitHub Desktop.
RVA23 vs ARMv9 a Small Experiment

RVA23 vs ARMv9 a Small Experiment

I was curious to see how RISC-V and ARM compare in terms of dynamic instruction count and code density, so I devised a small experiment to compare the ISAs.

As a test codebase, I choose the chibicc C compiler, because it's a medium size project and is quite easy to compile. To benchmark chibicc I just used it to compile itself, which should be a quite realistic workload to simulate a complex non-regular application. I merged all files into one and did some minor modifications, the code can be found at: https://godbolt.org/z/xr3nEW8Wf

You may notice that I added unoptimized scalar implementations of the mem* and str* functions from musl-libc. This is because I decided to not include SIMD code in this experiment, in an effort to remove more unknown variables and focus on comparing the base ISAs. Without these measures, the results seemed similar.

Without further ado, below is the table comparing the results:

staticQEMUGEM5
ISACompilerBytesBytesInsnsuops**Sim-TimeInsnsuops
RVA22 clang-19 772K 1221M 424M +0M = 424M 0.221s 438M +0M = 438M
RVA22 gcc-15 772K 1309M 445M +0M = 445M 0.217s 459M +0M = 459M
RVA23* clang-19 772K 1185M 423M +0M = 423M 0.243s 438M +0M = 438M
RVA23* gcc-15 772K 1265M 441M +0M = 441M 0.217s 456M +0M = 456M
armv9* clang-19 944K 1543M 386M +39M = 424M 0.225s 399M +70M = 469M
armv9* gcc-15 936K 1688M 422M +39M = 460M 0.236s 435M +66M = 501M

*excluding SIMD/vector instructions

**Derived from the "Apple Silicon CPU Optimization Guide": ((ld|st)\w.*#\w*(\]!)*$)|ldp|stp, so load/store pairs and pre-/post-index load/stores

As mentioned before, I'm explicitly excluding SIMD from this experiment, so I used the following compiler arguments to achieve this:

  • RVA22: -O3 -march=rv64gcb -static
  • RVA23: -O3 -march=rv64gcb_zcb_zfa_zicond -static
  • armv9: -O3 -march=armv9-a+nosimd+nosve -static

Let's go through the results from left to right.

Firstly, the static sizes of the RISC-V binaries is about 18% smaller than the sizes of the ARM ones. The sizes are extremely similar for both RVA22 and RVA23, and regardless of compiler. For the dynamic instruction size however, that is the sum of executed-instructions-lengths, there is a clear improvement going from RVA22 to RVA23. RVA22 needs to fetch 21% fewer bytes than armv9 and RVA23 goes down to 24% fewer bytes than armv9. AFAIK QEMU doesn't come with the ability to count the executed-instruction-lengths out of the box, so I needed to patched the tcg/plugins/insn.c plugin.

When it comes to dynamic instruction count, ARM ends up the clear winner with on average of 6.5% fewer instructions that need to be decoded. But the instructions them self isn't what ends up executing in the backend. Due to the more complex addressing modes present in ARM, but not in RISC-V, the CPU needs to crack some of them into multiple micro operations (uops). The "Apple Silicon CPU Optimization Guide" tells use that Apple Silicon processors primarily crack load/store pairs and pre-/post-index load/stores into two uops. To take this into account, I modified the tcg/plugins/insn.c plugin again, to allow me to count instructions matching a regex. Adjusted for uops the field evens out again and the clang codegen for each ISA ends up with roughly the same amount of uops. On GCC the RISC-V codegen even ends up with fewer uops.

GEM5 is a cycle accurate micro architectural simulator, which should theoretically allow us to simulate almost the same micro-architecture (O3 CPU model) for different ISAs. The GEM5 simulated time should however definitely be taken with a large grain of salt. In practice this it's unlikely that the two GEM5 models reflect what the same design team with the same budget, but different ISA targets, would arrive at. Still, we can observe that in this particular test, both RVA22 and RVA23 ended up executing faster than armv9. Interestingly, the gcc binaries took less time to execute than the clang binaries even though clang had a smaller dynamic instruction and uop count. Also, for some reason, clang ended up with worse results from RVA23 than RVA22.

The instruction count between QEMU and GEM5 differs, this is presumably an artifact of running with qemu user-mode emulation. The uop count is also different, and the ARM binaries ends up with a larger uop count than the RISC-V ones, which might explain the performance difference. I sadly couldn't figure out which instructions are cracked by the ARM GEM5 O3 performance model.

So in conclusion, in this small experiment, ARM and RISC-V roughly match in the number of uops they needed to feed their backends. ARM needed to decode fewer instructions, but crack some of them into two uops. RISC-V on the other hand needed to fetch fewer bytes overall, due to its compressed instruction encoding, but needed to decode more total instructions directly into uops, with very little instruction cracking. The difference between RVA22 and RVA23 in this codebase seems to be negligible. This is presumably because the biggest difference between the profiles is this addition of RVV support, which was disabled for this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment