Last active
February 21, 2025 02:21
-
-
Save camel-cdr/495937d851b39a85ac3be515e4dd132e to your computer and use it in GitHub Desktop.
rvv shishua
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
.global shishua_rvv # void shishua_rvv (uint64_t state[4], void *dest, size_t n) | |
shishua_rvv: | |
# load state (can easily be expanded to state[8] or state[16]) | |
vsetvli t6, x0, e64, m2, ta, ma | |
ld a4, 0(a0) | |
vmv.v.x v0, a4 | |
ld a4, 8(a0) | |
vmv.v.x v4, a4 | |
ld a4, 16(a0) | |
vmv.v.x v8, a4 | |
ld a4, 24(a0) | |
vmv.v.x v12, a4 | |
li t0, 0xbf58476d1ce4e5b9 # splitmix mul 1 | |
li t1, 0x94d049bb133111eb # splitmix mul 2 | |
.macro rvv_rand_splitmix64 v, t | |
# mix in vid | |
vsetvli t6, x0, e8, m8, ta, ma | |
vid.v \t | |
vsetvli t6, x0, e16, m8, ta, ma | |
vadd.vv \v, \v, \t | |
vmul.vx \v, \v, t0 | |
vid.v \t | |
vsetvli t6, x0, e64, m8, ta, ma | |
vadd.vv \v, \v, \t | |
vmul.vx \v, \v, t1 | |
# warmup with splitmix64 | |
vsrl.vi \t, \v, 30 | |
vxor.vv \v, \v, \t | |
vmul.vx \v, \v, t0 | |
vsrl.vi \t, \v, 27 | |
vxor.vv \v, \v, \t | |
vmul.vx \v, \v, t1 | |
vsrl.vi \t, \v, 31 | |
vxor.vv \v, \v, \t | |
.endm | |
rvv_rand_splitmix64 v0, v16 | |
rvv_rand_splitmix64 v8, v16 | |
# almost shishua | |
1: | |
# shuffle | |
vsetvli x0, a2, e32, m8, ta, ma | |
vmv.x.s a4, v0 | |
vslide1down.vx v16, v0, a4 | |
vmv.x.s a4, v8 | |
vslide1down.vx v24, v8, a4 | |
# shift | |
vsetvli x0, x0, e64, m8, ta, ma | |
vsrl.vi v0, v0, 1 | |
vsrl.vi v8, v8, 3 | |
# add | |
vadd.vv v8, v8, v24 | |
vxor.vv v24, v24, v0 | |
vadd.vv v0, v0, v16 | |
# store | |
vsetvli a3, x0, e8, m8, ta, ma | |
vse8.v v24, (a1) | |
add a1, a1, a3 | |
sub a2, a2, a3 | |
bnez a2, 1b | |
# reduce back to uint64_t state[4] | |
vsetvli t6, x0, e64, m4, ta, ma | |
vredsum.vs v20, v0, v0 | |
vredsum.vs v21, v4, v4 | |
vredxor.vs v22, v8, v8 | |
vredxor.vs v23, v12, v12 | |
vmv.x.s a4, v20 | |
sd a4, 0(a0) | |
vmv.x.s a4, v21 | |
sd a4, 8(a0) | |
vmv.x.s a4, v22 | |
sd a4, 16(a0) | |
vmv.x.s a4, v23 | |
sd a4, 24(a0) | |
ret |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a proof of concept implementation of the shishua PRNG for RVV.
Overview
The difficult part here is that we don't have a fixed vector length, but still want to know the size of the state at compile time.
My solution to this is to use smaller LMULs to broadcast the state to the entire vector. This obviously doesn't work on its own, because many elements will have the same value. To fix this,
vid
is mixed into the initial vector state, which assigns each vector element it's index (0,1,2,3,...). This doesn't give us much, but we hash the result of that to disperse the state as much as possible, before we go into the main loop.The main loop is slightly different from the reference shishua implementation, as we use a rotate on 32 bit elements instead of a full shuffle:
PractRand
Even when reducing the state from LMUL=8 to LMUL=1 (by 8 times), and using 16 bit shuffle and 32 bit add/shift, it managed to pass the first 64 GB of PractRand.
I wasn't able to execute this fast enough on my RISCV-V board, and qemu is also quite slow, so I could only measure up to 64 GB.
Edit: I was able to let it run a bit further and after 128 GB PractRand got suspicious, so I assume it will fail after a few more GB, but this applies to the massively a scaled down version. I also let a full scale version run up to 128 GB, and it didn't even get an unusual evaluation. I think it's fair to assume that the full scale version is pretty safe to use for non-cryptographic purposes, especially given that this is so close to the original shishua implementation.
Edit2: I was able to run a few more test, and
LUML=2
with 32 bit addition passes to at least 256 GB without any anomalies, andLMUL=1
64 bit addition, passes to at least 64 GB without any anomalies. (I wasn't able to run the test longer).Benchmark
My MangoPi MQ Pro, with a 1 GHz in order C906 core, which has a 128 bit vector length, but 256 bit ALU, managed to generate data at
1.1 GiB/s
using 16 bit shuffle and 32 bit add/shift (because it doesn't support 64 bit elements).The fastes other PRNG I could find for the platform was xoshiro256+, which got
0.5 GiB/s
. I suspect that this implementation will be way faster on future CPUs, because the C906 seems to have a big overhead for vector instructions. (E.g. vadd, vmadd and vmulh take same amount of time, even though vadd could be implemented way faster)Edit:
I was now able to run this on an C920 1.868 GHz out-of-order core, and I got
5 GiB/s
while xoshiro256+ got2.4 GiB/s
.