Note: To verify my RVI membership and idenity on this otherwise semi anonymous account: I'm Olaf Bernstein, you should be able to view my sig-vector profile, if you are a member of the vector SIG.
The goal of this document is to explore gaps in the current RISC-V Vector extensions (standard V, Zvbb, Zvbc, Zvkg, Zvkn, Zvks), and suggest instructions to fill these gaps. My focus lies on application class processors, with the expectation that suggested instructions would be suitable to become mandatory or optional instructions in future profiles.
I'll assume you are already familiar with RVV, if not, here is a great introduction and here the latest RISC-V ISA manual.
To determine which applications aren't currently suitably covered by the RISC-V Vector extensions (RVV) I collected usage examples of the SIMD instructions present in AVX512 and SVE, that don't have an RVV equivalent. You can find my notes at the bottom of this article. I'll use it as a source of truth when arguing for or against the inclusion of certain instructions.
Please be aware that I'm not a hardware designer, so please take anything I say with a grain of salt and feel free to correct my mistakes/wrong assumptions.
vmslide1up.m
,vmslide1down.m
: slide mask up/down by onevtrn1.vv
,vtrn2.vv
: 2x2 transpose instructionsvwzip.vv
: zip/interleave two vectorsunzip.vv
orvnsrl
defined for e64vbcompress
,vbexpand
: element-wise bit-compress/expandvbmatxor
: 8x8 bit-matrix multiply, x86 calls it vgf2p8affineqbvscansum.v vd, vs1
/vscanmaxu.v vd, vs1
: +-scan / max-scanvscansum.v vd, vs1, vm
/vscanmaxu.v vd, vs1, vm
: segmented +-scan / unsigned-max-scan:viotar.m
: segmented viota ("r" for resetting)vmsxff.m
: xor-scan on mask bits ("set-xor-from-first mask bit")
I believe the following instructions should be added to RVV, as they can solve problems that currently don't have a good solution.
Sliding mask registers is a very common operation that is currently not well addressed by RVV, because dedicated instructions are missing.
In many cases, it's possible to use element slides instead of sliding just the mask register. This is however a fundamentally more complex operation and performs a lot worse in practice, especially when the relevant vectors use LMUL>1. It also often requires you to duplicate comparisons or other operations done before the element slide.
Mask slides can be emulated directly, but this is expensive to do portably:
# slide up mask in v0 by 1-bit
vsetvli t1, zero, e8, MF8, ta, ma # MF8 needs to be MX/8
vslide1up.vx v1, v0, a0 # slide in next
vsrl.vi v1, v1, 7 # v2 = v1 >> 7
vadd.vv v2, v0, v0 # v2 = v0 << 1
vor.vv v0, v1, v2 # v0 = v1<<7 | v0<<1
vsetvli zero, t0, e8, MX, ta, ma
When measured, these 6 instructions still perform better than sliding the elements and repeating the comparison: SpacemiT X60, XuanTie C908, Saturn
If you write vector length specific (VLS) code and know you are working with 64 elements or fewer, then you only need a single vsll.vi
including a vtype change.
The current situation is a mess with no good solutions, especially for vector length agnostic (VLA) code.
Most use-cases of mask slides could be addressed using just two one source, one destination instructions: vmslide1up.m
, vmslide1down.m
Where vmslide1up.m
slide the mask register bits up by one and vmslide1down.m
down by one.
I don't think a GPR-based slide variant is particularly needed and slides further than one or two are also quite rare. A dedicated slide by 2-bit or 3-bit immediate instruction would however have the added benefit of allowing you to slide an entire LMUL=1 vector register by an immediate, this is sometimes useful for slides smaller than 8 bits. If this doesn't add much additional complexity a full register slide bits by small immediate instruction would also be an option.
RVV already has a very powerful permutation instruction with vrgather
and vcompress
, however some permutations are so common that dedicated instructions are required, as they can have lower latency and higher throughput.
The need for dedicated instructions for these very common permutations has already been discussed at length by RISE in the context of video codecs, see: "RISCV64 new vector instructions requirements for video/multimedia" and "Vector Operations Proposals - Draft Document".
Here I'd just like to highlight a few additional aspects:
- The implementation complexity of a
vtrn1
/vtrn2
is similar to bitwise operations, as such huge performance gains are possible vwzip
can already be implemented withvwmaccu.vx(vwaddu.vv(a, b), -1, b)
, as suggested by Craig Topper. These are two EMUL=2 instructions vs two EMUL=1 instructions if we had dedicated instruction, which would give us performance parity with existing ISAs.- I prefer a
vwzip
over thevzip2a
/vzip2b
proposed by RISE, because it works better with vl<VLMAX and fractional LMUL. Usingvzip2a
/vzip2b
would require an additionalvslideup
when vl<VLMAX or for fractional LMUL, unless you want the result to be split up between two vector registers. unzip1
/unzip2
isvnsrl.wi/wx
, which already works for everything but 64-bit elements, which currently require avcompress
orvrgather
to unzip. Instead of a dedicated unzip instruction, it may be preferable to definevnsrl
at SEW=64 (EEW=128), which would result in a more versatile instruction.
With the Zvbb ("Vector Basic Bit-manipulation") extension, RVV gained a set of basic bit-manipulation instructions, as the name suggests. This notable includes byte reverse, bit-reverse, count leading/trailing zeros, popcount and bit-rotates.
RVV is however still missing some advanced bit permutation instructions.
Some of the most powerful bit manipulation instructions present in today's ISAs are the bit-compress and bit-expand instructions.
- bit-compress takes two n-bit arguments and moves the bit in the first argument corresponding to the nth set bits in the second argument to the nth position of the destination. (
bcompress(0bHGFEDCBA,0b11110010)=0bHGFEB
- bit-expand takes two n-bit arguments, moves the nth bit in the first argument to the nth set bit in the second one. (
bdecompress(0b000HGFEB,0b11110010)=0bHGFE00B0
)
Implementing bit-compress/expand with existing RVV instructions is really expensive.
You either have to use a large series of shifts and bit operations to shift the bits into place, or convert each bit into a byte and do the vcompress.vm
/vrgather
on 8-bit elements with an 8 times larger LMUL.
x86 implements these instructions as pext
/pdep
under BMI2, but they only operate on GPRs, while ARM SVE only defines the bext
/bdep
instructions which operate on elements of vector registers.
Among others, they can be used for: bit interleaving/packing; base64 de/encoding; LEB128 de/encoding, which is used DWARF, Dalvik, protobuf and WebAssembly; regex matching; genomics; various board games; Space-filling curves; Quantum circuit simulation; ...
Here are is an example to illustrate how these instructions could be used to de- and encode base64:
base64 decode: | base64 encode:
vbrev8 [00dddddd|00cccccc|00bbbbbb|00aaaaaa] | vrgather [........|ccdddddd|bbbbcccc|aaaaaabb]
vbcompress [00aaaaaa|00bbbbbb|00cccccc|00dddddd] | vbexpand [........|aaaaaabb|bbbbcccc|ccdddddd]
vrgather [........|aaaaaabb|bbbbcccc|ccdddddd] | vbrev8 [00aaaaaa|00bbbbbb|00cccccc|00dddddd]
[........|ccdddddd|bbbbcccc|aaaaaabb] | [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
Bit compress and expand were the two instructions with the most real-world usage of the advanced SIMD instructions surveyed. There is just one problem, historically some implementations really struggled to implement these instructions well:
all intel, haswell to alderlake | Zen1/Zen2 | Zen3/Zen4 | Zen5 | Cortex A510/A520 | Neoverse N2/V2/V3 Cortex A710/X2/X3/X4 | Cortex A715/A720 | |
---|---|---|---|---|---|---|---|
Latency | 3 | 8 μops per set bit | 3 | 3 | 14-70 | 6 | 4 |
RThroughput | 1 | 8 μops per set bit | 1 | 0.33 | 14-70 | 2 | 2 |
The first two generations of Zen used a microcoded implementation with 8 μops per set bits, the Cortex-A510/A520 implementations are similarly bad. On the other hand, all intel processors since haswell managed to implement it with a three-cycle latency and throughput of 1 instruction per cycle. The other ARM cores also have a solid implementation, with a 4-6 cycle latency and a throughput of one instruction every two cycles.
Fortunately, we have multiple papers that give more details on the implementation challenges:
- "Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions" suggests that:
The hardest part of the implementation of the pdep instruction is the translation of the n-bit mask into the control bits for each stage of the butterfly network. Additionally, they present a comparison between an implementation with and without hardware decode. The one with hardware decode used about 3x more gates, 7.6K vs 22.1K NAND gates (Table 2).
- "Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications" from the same authors revisits the problem and has slightly different numbers. Their hardware decode implementation takes about 2.5x more gates, 6.6K vs 16.4K NAND gates (Table 3.2).
- Claire Wolf's bextdep repo contains various bit-compress/expand implementations, including area evaluations.
Of note are the up-to-64-cycles sequential implementation and the 3-stage pipeline implementation:
- up-to-64-cycles sequential: Gates: 3123, LUTs: 675
- 3-stage pipeline: Gates: 8846, LUTs: 309
- "RISC-V Extensions for Bit Manipulation Instructions" bit-compress/expand took about 3x more gates than a barrel shifter (3.3K vs 0.9K Gates)
This seems to suggest that the implementation complexity is reasonable, although the decode of the mask to control signals is disproportionately expensive.
These instructions should be added to RVV, since they are very widely applicable, can help accelerate working with very common data formats and the costs seem reasonable.
Several ways to reduce the implementation cost can be explored if feedback from implementors suggests that the implementation cost is too high:
- Since the mask operand is often a fixed constant, we could decide to only add
vbcompress.vx
/vbexpand.vx
variants, such that only a single mask needs to be decoded. - There could be separate extensions for SEW≤32 and SEW≤64 support, as a lot of the interleaving use cases can operate on fewer than 64 bits.
The Sheep-and-goats (SAG) operation, or bgrp as it's called in SVE, groups bits to right or left as selected by bitmask ((bcompress(x, ~m) << popc(m)) | bcompress(x, m)
).
It can perform an arbitrary bit permutation in ceil(log2(nbits)) chained SAG operations.
While this is quite powerful, it's very rarely used and can already be done by combining two vbcompress
instructions with a single vslideup
instruction.
I would only consider this worth adding if a hardware implementation of vbcompress
and vbexpand
gets this operation basically for free.
This doesn't seem to be the case, here adding SAG increases the area by 50% over just implementing vbcompress
/vbexpand`.
Going with this 50% area increase, I'd rather hardware designers use these resources to make vbcompress
/vbexpand` faster than adding a dedicated SAG instruction.
This instruction is present in Knuth's MMIX as MXOR and modern x86 processors with support for GFNI as VGF2P8AFFINEQB. It was also considered for the RISC-V scalar bit manipulation extensions but didn't end up in the final specification.
The instruction interprets a 64-bit number as a 8x8 bit matrix, and multiplies two of such 8x8 bit-matrices together, replacing addition with XOR and multiplication with AND:
Assume that the 64 bits in $Y are numbered as follows (and similar for the bits in register $Z and $X): Y₀₀,Y₀₁...Y₀₇,Y₁₀,Y₁₁...Y₁₇,...,Y₇₀,Y₇₁...Y₇₇
Now bit Xᵢⱼ in register register $X is computed as follows: MXOR: Xᵢⱼ = Y₀ⱼ & Zᵢ₀ ^ Y₁ⱼ & Zᵢ₁ ^ ... ^ Y₇ⱼ & Zᵢ₇
This might not immediately sound useful, but it has a lot of unexpected uses.
Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte (Why Ice Lake is Important (a bit-basher’s perspective)")
Although creating the constants needed for your permutation is non-trivial, the x86 vgf2p8affineqb
instruction still sees a lot of real-world use.
Prominent examples include: Arbitrary Modular GF(2ʷ) Multiplication, RAID error correction, PAR2, SM4, emulating missing 8-bit AVX instructions, sub-byte arithmetic, very fast SIMD byte histogram, abs(x) >> shift in av1 decode, 8x8 and 16x16 bit matrix transpose, indices to set bits
It can be used to emulate a bunch of 8-bit instructions that already exist in RVV, which may seem redundant.
However, if this instruction was added to RVV it should operate on EEW=64 EMUL=LMUL, which would allow for doing the 8-bit operations without a vtype change, removing up to two vsetvli
s from certain code paths.
The hardware implementation cost seems manageable.
Claire Wolf shared the following on the isa-dev mailing list:
The HW cost for bmat[x]or is not as big as one might expect. My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.
"Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications" also covers this instruction and their single cycle implementation takes 2.4K NAND gates.
Existing vgf2p8affineqb
implementations corroborate this:
vgf2p8affineqb ymm, ymm, ymm, i8 | icelake | tigerlake | rocketlake | alderlake-E | alderlake-P | Zen4 |
---|---|---|---|---|---|---|
Latency | 5 | 5 | 5 | 5 | 3 | |
RThroughput | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
Due to the versatility and reasonable hardware cost, I suggest adding the vbmatxor.vv
and vbmatxor.vx
instructions.
The .vx
variant is included, because most uses of vgf2p8affineqb
broadcast a 64-bit constant to perform the same 8-bit permutation on all bytes.
Scan operations, also called prefix operations, are reductions that write back the intermediate results, effectively computing all partial reductions.
RVV already has a few "hidden" scan instructions that operate on mask registers:
vmsbf(~m)
is an and-scan on mask-bitsvmsif(m)
is an or-scan on mask-bitsviota(m)
is an +-scan on mask-bits that is widened to SEW
Scan operations on vector elements and segmented variants are very important missing instructions because they allow you to efficiently parallelize an entirely new paradigm of problems. Segmented scans, scan multiple segments of elements separately, as encoded by a mask. A set bit in the mask indicates that a new segment starts at this position. Implementing this using existing RVV instructions is very expensive.
Segmented scans allow you to efficiently process multiple variable-length sequences of input in parallel. This is best illustrated by an example.
Consider vectorizing run-length encoding, that is, turning an input of bytes into pairs of bytes encoding a repetition count and a byte value to repeat. With the classical SIMD/vector model, you can only process one "run" at a time, even though a vector register might contain multiple runs. While this isn't a big problem for smaller vector lengths, it can be a huge source of inefficiency when dealing with longer vectors. Using a segmented max-scan allows you to process all the runs in a vector register independently:
RLE using +-scan:
0: 8 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 | v0 as input
1: 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 9 | v1 = slide1up(v0, v0[0]+1)
2: 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 | m0 = v0 != v1 __
3: 0 e 0 0 b 0 9 8 0 0 0 0 3 2 0 0 | v2 = m0 ? vid : 0 \
4: e e b b b 9 9 8 3 3 3 3 3 2 0 0 | v3 = +-scan(v2, 0) | does seg-viota(m0)
5: 1 0 2 1 0 1 0 0 4 3 2 1 0 0 1 0 | v4 = vid - v3 __/
6: 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 | m1 = m0 >> 1
7: 7 8 7 8 7 8 | val = compress(v0, m1)
8: 2 1 0 4 0 1 | repeats = compress(v4, m1)
Notice how the above code uses a max-scan to implement a segmented viota. This works for both seg-viota and seg-+-scan:
seg-+-scan:
6 5 4 3 2 1 | v0
1 0 1 0 0 1 | m0 | seg-viota:
0 1 0 1 0 0 | m1 = m0 >> 1 | 0 0 1 0 0 1 1 0 0 0 0 1 | m
21 15 10 6 3 1 | v1 = +-scan(v0) | 0 0 9 0 0 6 5 0 0 0 0 0 | v1 = m ? vid : 0
0 0 0 0 15 6 | v2 = vcompress(m0,v1) | 9 9 9 6 6 6 5 0 0 0 0 0 | v2 = +-scan(v1,0)
0 0 0 15 6 0 | v3 = vslide1up(v2,0) | 2 1 0 2 1 0 0 4 3 2 1 0 | seg-viota = vid - v2
2 1 1 0 0 0 | v4 = viota(m1)
15 6 6 0 0 0 | v5 = vrgather(v3,v4)
6 9 4 6 3 1 | seg-+-scan(v0,m0) = v1 - v5
AVX512 can also decode multiple runs in parallel using the more complex vpconflict
instruction from the AVX512CD extension:
0: v0 as input
1: cfss = _m512_conflict_epi32(v0)
2: lzcnt = _m512_lzcnt_epi32(cfss)
3: m = _mm512_movepi32_mask(cfss)
4: scfss = _m512_sllv_epi32(cfss, lzcnt)
5: nscfss = _m512_andnot_epi32(scfss, ~0)
6: rl = _m512_lzcnt_epi32(nscfss)
7: repeats = _mm512_maskz_compress_epi32(m, rl)
8: val = _mm512_maskz_compress_epi32(m, v0)
This took the same instruction count as the +-scan variant but only works for 32 and 64-bit elements.
Falvuy noted that a slightly different storage format can be vectorized to handle multiple runs at ones using the element compress instruction, as described in this paper.
An example that AFAIK can't be efficiently implemented in AVX512 is parsing a series of integer numbers, in this case up to 16-bit, in parallel. The best x86 implementation I know of uses a permutation lookup table with 2¹⁶ entries to distribute elements of a 128-bit vector into power-of-two-sized chunks, such that they can be summed using multiple horizontal adjacent pair addition instructions.
Using seg-+-scan this becomes possible for arbitrary VLEN, here is a simplified implementation:
_ 5 6 7 8 _ 5 6 _ 5 6 7 _ | v0 = vsub(input, '0')
_ 7 6 5 _ 6 5 _ 8 7 6 5 _ | v1 = vrgather(v0, reverse)
1 0 0 0 1 0 0 1 0 0 0 0 1 | m0 = vmsgt(v1, 9)
0 3 2 1 0 2 1 0 4 3 2 1 0 | v2 = seg-viota(m0)
0 100 10 1 0 10 1 0 1000 100 10 1 0 | v3 = vrgather(pow10tbl, v1)
0 700 60 5 0 60 5 0 8000 700 60 5 0 | v4 = vwmul.wv(v3, v1)
0 765 65 5 0 65 5 0 8765 765 65 5 0 | v5 = seg-+-scan(m0, v4)
0 1 0 0 0 1 0 0 1 0 0 0 0 | m1 = m0 >> 1
765 65 8765 | v6 = vcompress(v5, m1)
This implementation directly widens the elements to 16-bit.
An optimized implementation should in place combine and widen adjacent elements (A*10+B), using a seg2 load and vwmul
.
There are a lot of problems that benefit from scans and segmented scans: Line-of-Sight Problem, prefix sum, average word length, sparse matrix-vector multiplication (SpMV), sum of values from neighbors in a graph, Minimum Spanning Tree, Maximum Flow, ...
Note that seg-copy-scan or segment-broadcast used in some of the linked problems is just trivially compress + viota + vrgather
.
"Vector Models for Data-Parallel Computing" covers even more use cases, if one were to include segmented split (element-wise Sheep-and-goats), which is probably too expensive to implement in practice. Segmented split can be implemented in <15 instructions using segmented viota
and a lot more than that without scan instructions.
To reiterate, segmented scans allow you to efficiently vectorize new kinds of problems and are especially important to get good utilization on larger vector registers.
New instructions are only useful if they are implementable at reasonable speed and implementation cost. I'd like to argue that implementing scan counterparts to existing reduction instructions would not provide a significant hardware challenge.
The first half of the work-efficient parallel scan implementation already matches the reduction tree used in reductions. The second half is just a mirrored version of the up-sweep with offset inputs and outputs.
Segmented variants are conceptually a bit more complicated, but don't require much additional hardware. You need to propagate a single additional bit, from the segment starts, through the scan tree, which pins the value, such that subsequent steps don't change it. (see above)
I was not able to find any reports on hardware implementations, especially ones on how difficult it is to reuse reduction circuitry to implement scans would be insightful.
One software implementation reported that implementing segmented variants on top of existing scan instructions is quite straightforward:
On the CRAY Y-MP, the asymptotic running time of the plus-scan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the un-segmented versions. (source)
Assuming additional implementation complexity of segmented scan versions of existing reductions is reasonable, such instructions should be added to RVV, because they allow vectorization of new kinds of problems, while efficiently utilizing even very long vector registers.
While the reduction instructions take a second register source for the initial value, I'd recommend dropping this for the vector scans to conserve encoding space. Similarly, I wouldn't add separate instructions for segmented scans but rather define the masked variants of scans to work as segmented scans. AFAIK, masked scans and masked segmented scans are rare and can be implemented with additional instructions if needed.
As a minimal set of scan instructions, I would propose the following:
- +-scan/unsigned-max-scan:
vscansum.v vd, vs1
,vscanmaxu.v vd, vs1
- segmented +-scan/unsigned-max-scan:
vscansum.v vd, vs1, vm
,vscanmaxu.v vd, vs1, vm
- segmented viota:
viotar.m
, the "r" for resetting, which described the behavior: it counts up until a mask bit is set, at which point it resets to zero and continues
A full set of scan instructions corresponding to existing reductions would include the following additions:
- unsigned min-scan:
vscanminu.v vd, vs1
- signed min/max-scan:
vscanmin.v vd, vs1
,vscanmax.v vd, vs1
- floating-point +-scan:
vfscanusum.v vd, vs1
- including segmented versions:
vscanminu.v vd, vs1, vm
,vscanmin.v vd, vs1, vm
,vscanmax.v vd, vs1, vm
,vfscanusum.v vd, vs1, vm
It could also include element-wise and/or/xor scans, but I don't see a need for these bitwise scans.
and-scan and or-scan already exist on mask registers, where they are actually useful:
vmsbf(~m)
is an and-scan on mask-bitsvmsif(m)
is an or-scan on mask-bits
Coincidentally, the next section will argue for a xor-scan on mask bits.
This operation is often used in SIMD parsing to match delimiters and is equivalent to carryless multiply with -1.
Usage: simdjson, json link simdzone, simdcsv
It can be emulated using the following instruction sequence:
# (xor_scan(00100100)>>1) == 00011100, it's offset by one
vsetvli t0, x0, e8, m1, ta, mu # requires `EMUL=LMUL*(8/SEW)`
viota.m v1, v0
vand.vi v1, v1, 1
vmseq.vi v0, v1, 1
The above is three LMUL=1
instructions when LMUL*(8/SEW)≤1
, but you may need to change the surrounding code to accommodate the offset when porting existing code.
The following variant will likely be faster when LMUL*(8/SEW)>2
:
# xor_scan(00100100) == 00011100
li t0, -1
vsetvli t0, x0, e64, m1, ta, mu
vclmul.vx v1, v0, a1 # 64-bit parallel-prefix XOR
vmslt.vi v0, v1, 0 # top bit set
viota.m v2, v0 # parallel-prefix XOR of top bits
vand.vi v2, v2, 1
vmseq.vi v0, v2, 1
vnot.v v1, v1, v0.t # apply flip
Expected RThroughput for SEW=8 from SiFive P670 llvm-mca model:
method | m1 | m2 | m4 | m8 |
---|---|---|---|---|
dedicated instruction | 1 | 1 | 1 | 1 |
viota | 3 | 4 | 8 | 16 |
vclmul+viota | 6 | 6 | 6 | 6 |
Vector length-specific code could have a specialized VLEN≤512
path, that uses a single vclmul.vx
to do the prefix-xor at LUML=1
in a single instruction.
The current situation isn't bad however you get the implementation almost for free if your processor already implements vclmul
.
Because it's so cheap to implement, uses minimal opcode space and to encourage VLA code, it would be advantageous to add this as a dedicated instruction.
Notes on naming:
While vmxorscan
would be a possible name, it doesn't quite fit with the existing mask instructions:
- or-scan is
vmsif(m)
"set-including-first mask bit" - and-scan is
vmsbf(~m)
"set-before-first mask bit"
To harmonize the names, vmsxff
for "set-xor-from-first mask bit" might be a better alternative.
Reversing elements and mask-bits isn't a common operation, and there are already performance-portable ways of implementing them for application processors:
element-reverse:
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid
# To reverse LMUL>1 vrgather at LMUL=1 seperatly and use moves to swap LMUL=1
# vector registers
mask-reverse:
vbrev.v v0, v0
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid
However, there are a few caveats that make dedicated instructions attractive.
- Firstly, both of these reversals require vl=vlmax and require an additional
vslidedown
when vl<vlmax. - Secondly, both approaches don't work for SEW=8 with VL>256, and require vrgatherei16 and 16-bit indices instead, which often performs worse.
- Thirdly, while using
vrgather
to reverse elements will perform well on sane application class processors, but on other targets, SEW=8vrgather
performs significantly worse than SEW=64vrgather
. In which case, you should probably use a SEW=64vrgather
with an additionalvbrev8
, when reversing a SEW=8 vector.
While I don't think this is a very common operation, a dedicated reverse instruction would make code using it easier to write and potentially faster.
Both element and mask-reversal instructions do not take a lot of encoding, as they have only one source and one destination register.
For string algorithms, a fast 16x8b LUT is non-negotiable, and a lot of other integer workloads also require it or generally fast shuffles within 128-bit lanes.
This is currently addressed by an LMUL=1
vrgather.vv
, however scaling it to larger vector length seems impractical as complexity grows quadratically.
If you have the silicon budget for a 256-bit lane vrgather.vv
primitive, but a vector length of 1024, then a naive hardware implementation would need to apply the primitive (1024/256)^2=16
times.
We see this impacting existing implementations.
The SiFive X280 has VLEN=512 and a native 256-wide data path, but has vl
-cycles occupancy for LMUL >= 1
, and a special case ~1-cycle occupancy for LMUL ≤ 1/2
, the data path width. (See [issue thread](It riscvarchive/riscv-v-spec#929 (comment)))
There is a silver lining however as an implementation could detect when all indices in a lane point to the same other lane, and only execute the primitive once for that lane. This would improve the above example to 4 applications of the primitive for a lot of common cases including the 16x8b LUT, any in-lane permutation, element reversals, ...
I'd imagine that high-performance out-of-order cores would have trouble implementing this optimization, but I don't think it's as relevant for those cores. They are unlikely to use very long vectors and will have the budget to implement a full LMUL=1
vrgather.vv
primitive. See existing x86 and ARM cores.
On the other hand, the high-performance out-of-order P470 and P670 cores from SiFive seem to implement this optimization for LMUL>1 vrgathers
, where this doesn't even matter as much because software can just unroll into multiple LMUL=1 vrgathers
if there is no lane crossing.
Nonetheless, we see in existing AVX512 implementations, that the latency of a in-lane gather is smaller than using a full lane-crossing vrgather:
latency for zmm | icelake | tigerlake | rocketlake | alderlake-P | Zen4 |
---|---|---|---|---|---|
vpshufb | 1 | 1 | 1 | 1 | 2 |
vpermb | 3 | 3 | 3 | 3 | 6 |
Additionally, while SVE originally didn't include 128-bit lane permute, SVE2.1 recently added it as TBLQ.
A separate instruction may simplify implementations, but it would also be an incentive to neglect the LMUL=1
vrgather.vv
implementation, which is also an important operation.
This should be mostly mitigated because the base V extension only has the full lane-crossing-gather, hence a lot of software will use lane-crossing operations and force vendors to implement it competitively.
So I'm not sure if a separate instruction is needed, but it seems like it would help simplify implementations and enable more performance portable RVV code.
A) gather within 128-bit lanes, indices are in the corresponding lane of the index vector register
- pros
- allows for different shuffles per lane, this seems to be a rare use case however
- index lanes are next to data lanes, so there may be less data movement in certain architectures.
- cons
- requires an additional 128-bit element broadcast instruction to be more easily useful, such an instruction does have general utility though
- increases register pressure, as a LMUL>1 gather would require an LMUL>1 index register
B) gather within 128-bit lanes, indices are in the lowest 128-bit of the index register, potentially always as 8-bit
- pros:
- no 128-bit broadcast needed, thus easier programming model
- only uses a LMUL=1 index register
- with 8-bit indices, shuffles for e16 and above can be loaded from a single 64-bit GPR (.vx variant possible, although the semantics would be weird for e8)
- cons:
- Use case of different shuffles per lane not supported
- may require more data movement and increase latency compared to indices in the corresponding lanes
I don't think shuffles from immediate (like vpshufd) are needed to reduce vector register pressure since most other instructions have immediate and GPR variants. It would also require an 8-bit immediate, which would take a lot of encoding space. As such, this would only make sense as a destructive operation using the 5-bit immediate and additional 5-bits from the rs1 encoding. This would use two more bits than necessary, which could either be repurposed as additional opcode space or directly encode the EEW.
AVX512 introduced the very powerful vpternlogd
/vpternlogq
instructions, which use an 8-bit immediate to select an arbitrary ternary logic operation.
They can express the following bitwise operations in a single instruction:
C xor A xor B
(A orn B) and C
C ? A : B
bitwise mergeC ? A nor B : A xor B
exactly one bit setC ? A xor B : A and B
exactly two bits set- Used in MD5:
C ? A : B
,A xor B xor C
,A xor (B orn C)
- Used in SHA-1:
C ? A : B
,A xor B xor C
,C ? A or B : A and B
- Used in SHA-2:
C ? A : B
,C ? A or B : A and B
RVV defines all possible binary operations for mask registers (and, andn, nand, nor, or, orn, xnor, xor
), but only and, or, xor, andn
for vector elements.
The binary operations on vector elements do have a GPR and immediate variant, which can be quite useful.
While a fully generic ternary logic instruction would be a great design point, I don't think it would fit into the current ISA very well, as the existing logic instructions could've been defined on top of it.
Using the existing RVV binary mask operations it takes up to four instructions to simulate what vpternlog
can express in a single instruction.
That isn't too bad, but adding a ternary conditional selection instruction (C ? A : B
) would bring it down to three or fewer instructions with a maximum dependency chain length of two instructions.
I think it would be reasonable to add the missing binary operations for vector elements, as well as a ternary conditional selection (C ? A : B
), which should also have a variant for mask operations.
The missing binary operations don't really need an immediate variant, and I'd be happy with just defining vector-vector variants to preserve opcode space.
I've placed this under "not sure if needed", because I'm not sure how much real world difference additional instructions would make. I'm also unsure how to quantify the impact of adding a three source instruction to the mask unit, which previously only had two source operand instructions.
Obviously, this won't cover things instructions are directly available. Instead, it will cover instructions present in other ISAs, that I don't see as needed.
Both AVX512 and SVE provide instructions that do all-to-all element comparisons for finding and counting equal elements.
AVX512 has vp2intersect
and vpconflict
, which both do an all-to-all comparison, with 32 and 64-bit variants.
SVE has MATCH
and HISTSEG
for 8 and 16-bit elements, which do an all-to-all comparison within every 128-bit lane, and HISTCNT
which does a lane-crossing all-to-all comparison for 32 and 64-bit elements.
Historically vpconflictd
and vp2intersect
were famously quite slow on Intel:
Latency/RThroughput | all intel, icelake to alderlake | Zen4 | Zen5 |
---|---|---|---|
vpconflictd ymm, ymm | 17/9 | 6/0.67 | 6.9/1 |
vpconflictd zmm, zmm | 26/19.5 | 7/1.33 | 6.9/1 |
vp2intersectd ymm, ymm | 27/11 (tiger/alderlake only) | 6/0.67 | ?/1 |
vp2intersectd zmm, zmm | 42/23 (tiger/alderlake only) | --- | ?/1 |
The performance of vp2intersect
on Intel was so bad, that a faster-than-native alternative software implementation was found.
Intel deprecated VP2Intersect from 2023 onward, while AMD re-introduced it on Zen5 with a very performant implementation.
The SVE instructions don't have this problem, presumably because the vector lengths are all just 128-bits.
Latency/RThroughput | Cortex A510 | Cortex A520 | Neoverse N2 | Neoverse V2/V3 | Cortex A710 | Cortex X2/X3/X4 | Cortex A715/A720 |
---|---|---|---|---|---|---|---|
MATCH | 7/4 | 7/4 | 2/1 | 2/1 | 2/1 | 2/1 | 2/0.5 |
HISTSEG | 8/5 | 8/2.5 | 2/0.5 | 2/0.25 | 2/0.5 | 2/0.25 | 2/0.5 |
I could find almost no usage of the SVE instructions, while vpconflict
and vp2intersect
were slightly more common.
If RVV was to add such an instruction, it would need to follow SVE in restricting them within lanes, otherwise, implementations of a certain VLEN are just not reasonably possible, due to the quadratic growth of the needed comparisons. This restriction makes the instructions a lot less nice to use and, as I'll argue in the following, not needed to suitably address the most common use cases of these instructions.
By far the most common application of these instructions is histogram computation/data binning. This involves filling a vector with indices of buckets that need to be incremented, detecting conflicts, resolve them, and finally gather+increment+scatter the values to the buckets.
Intel advertised vpconflict
explicitly for allowing their compiler to auto-vectorize histogram computation.
Currently, only vendor compilers are capable of this and only when the arguments are marked as restrict
.
The biggest benefit of such vectorizations isn't the histogram increment itself, rather the vectorization of the rest of the calculation. Auto-vectorizing compilers could just vectorize the loop, up until the histogram increment, at which point it could iterate over the vector register entries and increment the histogram using scalar code.
Furthermore, when measured, using vpconflict
doesn't seem beneficial for histograms when measured (Zen5 results, benchmark code) and even outperformed by scalar code, extremely so on AMD. This was not the case on current RVV implementations, presumably because they have very weak in-order scalar execution compared to the tested x86 processors. The out-of-order execution resources of today's processors seem to already do a great job at executing load/stores in parallel while avoiding conflicts.
There are many ways to efficiently vectorize SIMD creation without a dedicated conflict detection instruction, which cover almost all use cases:
- Small histograms can benefit greatly by hand-tuning for the specific size: hand-tuned byte histogram without
vpconflict
, see the "aptroot" line in the graph below - On X86 and presumably, other highly aggressive out-of-order cores, using scalar code to increment the histogram values is faster than using memory gather/scatter.
- The histogram is small enough to replicate it VL times, forgoing the need for conflict detection entirely.
- The histogram is too large to replicate, consequently, conflicts are rare. In this case, you can quickly check if a conflict happened by doing an additional gather after the scatter write-back of incremented values to check if the difference between the original values is exactly equal to
vl
. If a conflict occurs, then conflicting buckets will only be incremented once, and you'll be able to tell that there weren'tvl
increments as expected. - Finally, conflict detection can always be implemented with N/2 iterations of existing instructions: https://riscv.godbolt.org/z/rEY6oM6ne
Due to the above alternative implementation strategies covering the histogram use case almost entirely, I don't see it as a good motivation for a dedicated instruction.
The other slightly bigger use case for these all-to-all comparison instructions is computing set intersections and other set operations that follow from that.
I could find two libraries that used these instructions to implement set intersection, tetzank/SIMDSetOperations ashvardanian/SimSIMD. However, I couldn't find any external code using these set intersection functions.
Let's look at some benchmarks comparing implementations using the dedicated instructions with ones that aren't using them:
- SimSIMD set intersection:
- The SVE
HISTCNT
andMATCH
implementations were slightly slower than the equivalent NEON variant on Neoverse V2, which has a 128-bit SVE implementation. - The AVX512 implementation with
vp2intersect
was on average only 5% faster then the AVX512 shuffle variant on Zen5. Although, the results varied a lot. The biggest speedup of 1.60x occurred when intersecting 128 with 8292 16-bit elements, while the biggest regression that was 1.25x slower when intersecting 1024 with 1024 16-bit elements.
- The SVE
- SIMDSetOperations set intersection:
- On Skylake-X: The
vpconflict
implementation was 1.6x slower than the AVX2 shuffle implementation. - Zen5: The
vp2intersect
implementation was 1.5x faster than the AVX2 shuffle implementation, while thevpconflict
one was 1.2x slower. - Pretending we have a fast
vp2intersect
on other processors: Since I don't have access to a CPU with fastvp2intersect
I modified thevp2intersect
implementation to use AVX2 and replaced thevp2intersect
instruction with a sequence of instructions that have the same latency, asvp2intersect
has on Zen5, so 7 cycles. Since AVX2 doesn't have a compressed store, I replaced it with a regular store. This will produce the wrong result, but should give an indication of the expected performance if a fastvp2intersect
instruction was available.- Skylake-X: fake-vp2intersect was ~1.35x faster
- Zen1: fake-vp2intersect was ~1.1x faster
- Coffe Lake: fake-vp2intersect was ~1.3x faster
- On Skylake-X: The
That's not a large improvement for an instruction of quadratic implementation complexity.
The SIMDSetOperations benchmark intersected two large, equal-sized, sets with ~1/3 intersecting elements. Often this is not the case and one set is a lot larger than the other, e.g. in keyword searches and graph analytics. (source)
For intersecting very large sets with smaller ones, there are alternative algorithms that don't benefit much from such instructions. This paper on fast SIMD set intersection proposes precomputing a bloom filter-like structure to represent the large set. It then uses specialized NxM intersection kernels to resolve matches more efficiently:
This approach could be expanded by batching multiple NxM intersections with the same dimensions and resolving them using a VL/K MxN intersection kernel. For instance, the following code could resolve VLEN/64 2x4 intersections of 32-bit elements at once:
# inputs: LMUL=1 v2 # ...BAbaBAba
# LMUL=2 v4 # ...DCBAdcbaDCBAdcba
# output: v0 # intersecting elements
vsetvli x0, a0, e32, m1
vtrn1.vv v6, v2, v2 # ...AAaaAAaa
vtrn2.vv v8, v2, v2 # ...BBbbBBbb
vwzip.vv v10, v6, v6 # ...AAAAaaaaAAAAaaaa
vwzip.vv v12, v8, v8 # ...BBBBbbbbBBBBbbbb
vsetvli x0, a1, e32, m2
vmseq.vv v0, v4, v10
vmseq.vv v1, v4, v12
vmor.mm v2, v0, v1
vcompress.vm v0, v4, v2
# Note: I'm using vtrn/vwzip as proposed earlier.
# It could also be implemented two vrgathers instead,
# but for that you'd want to create the indices outside of the main loop.
This approach allows you to get almost full utilization of your vector length.
A dedicated full-register intersection instruction wouldn't be that beneficial, because the sub-intersections are usually small. What could be beneficial, is a lot of dedicated instructions for some of the larger or harder-to-implement kernels, e.g. for multiple 3x4, 3x5, 4x5, ... intersections.
The user uses I could identify were:
- sparse vector dot product, is functionally equivalent to set intersection,
- Run-length encoding, can more efficiently be solved with segmented scans, see above.
- there were a handful of uses where
vpconflict
was directly followed by memory gather/scatter, but I wasn't clearly able to tell if it's a histogram in disguise. Still, in these cases the memory accesses are likely going to be the bottleneck.
The implementation cost of dedicated intersection/conflict detection instructions is very high, and the benefits aren't clear. Since the most common applications can be efficiently vectorized using other methods, I don't consider this good candidate instructions for general-purpose application class processors.
If vendors see a need for dedicated instructions similar to the ones discussed above for specific applications, e.g. database applications, then it's certainly worth considering. Such applications would probably benefit from more specialized instructions that match the exact usage of the application. In the intersection algorithm described above, that may be many instructions that do multiple small intersections of different sizes in parallel.
Feel free to share feedback and suggest additions and alternative implementations.
Expand me
-
Arbitrary bit-permutation
bitshuffle_epi64_mask
/vpshufbitqmb
- arbitrary bit-permute of GPR showcase
- bit-permute mask registers, dav1d
- used in finite state machine
- bit-interleaves on GPRs, can be done via 2-3 pdep/pext (bit compress/expand), see BMI implementation
- Additional notes:
RVV can already do bit-permutations via vrgather on bytes converted from/to a mask register.
Using a single LMUL=1 vrgather matches the bits you can
bitshuffle
on AVX512/VL on the same VLEN.
-
Bit-matrix operations
gf2p8affine_epi64_epi8
/vgf2p8affineqb
/bmatxor
- RISC-V BitManip v0.93 wiki notes on bit-matrix operations (bmatxor, bmator, bitmatflip):
- bit permutation (within bytes), byte permutation, bit-duplication (within bytes), byte duplication
- many xor / many or (think "vector lite")
- full NxM bit matrix multiply (using many 8x8 ops)
- searching (finding zero bytes in 8-byte chunk)
- linear algebra in GF(2)
- arithmetic in GF(2k) with k ≤ 8
- "Unexpected Uses for the Galois Field Affine Transformation Instruction"
- Why Ice Lake is Important (a bit-basher’s perspective) "Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte"
- "Use AVX512 Galois field affine transformation for bit-shuffling" "Arbitrary reshuffle bits within a byte. We may reverse bits, rotate, set the order of bits, broadcast selected bit(s), etc." "Gather in a byte selected bit from a lane."
- used in RAID "GFNI support for amd64, for up to 3x faster processing.", only uses gf2p8affine_epi64_epi8 from GFNI
- used in PAR2, for GF16 arithmetic
- SM4 sBox, RISC-V has SM4 support in vector crypto
- very fast SIMD byte histogram
- abs(x) >> shift, in dav1d
- turn indices into set-bits, RVV could widen indices to 64-bit, do LMUL=8 1 << idx and do LMUL=8 vredxor
- broadcast selected bit in byte
- bit interleave in byte
- swap even/odd bits
- 8x8 bit matrix transpose
- 16x16 bit matrix transpose
- find first byte occurrence in 64-bit lane
- broadcast imm8
- Arbitrary Modular GF(2w) Multiplication
- Fixed 2-bit Packed Arithmetic
- reverse bits in byte, RVV has vbrev
- 8-bit shift/rotate, relevant article, exists in RVV
- full vreg prefix xor, RVV can use vclmul+viota
- additional use-cases
- 8-bit left shift + add, RVV has vfmacc/vfmadd
- bitwise NOT, RVV has vnot
- count trailing/leading zeros, RVV has vclz/vctz
- RISC-V BitManip v0.93 wiki notes on bit-matrix operations (bmatxor, bmator, bitmatflip):
- Additional notes:
- "Advanced SIMD instructions usage survey notes" covers this instruction
- The immediate argument in vgf2p8affineqb is almost always zero. It's just an additional XOR applied to every byte so shouldn't be included in a vbmatxor RVV instruction.
- isa-dev messag from Claire Wolf "The HW cost for bmat[x]or is not as big as one might expect. My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders."
- It can be used to emulate a bunch of 8-bit instructions that already exists in RVV, which may seem redundant. However, if this was added to RVV it should operate on EEW=64 EMUL=LMUL, which would allow for doing the 8-bit operations without a vtype change.
-
All-to-all equality comparisons, detecting duplicates
conflict_epi16/32
/vpconflict
- SIMDSetOperations
- set intersection
- set difference
- set union, code path not active, uses alternative implementation
- sparse vector dot product, functionally equivalent to set intersection
- run-length encoding multiple runs in parallel, can also be done via segmented viota or max-scan
- factorization: determine unique indices, followed by memory gather/scatter
- unique for histogram benchmark, followed by memory gather/scatter
- LAMMPS Molecular Dynamics Simulator, followed by memory gather/scatter
- flux calculation, followed by memory gather/scatter
- linear probing, followed by memory gather/scatter
- SIMDSetOperations
2intersect_epi32/64
/vp2intersect
- "Faster-Than-Native Alternatives for x86 VP2INTERSECT Instructions"
- set intersection, I don't have benchmark numbers yet, however emulated vp2intersect gave a good speedup for some input sizes
- set intersection
- sum of sparse arrays, uses emulated vp2intersect to find matching indices
svmatch
/svhistseg
- digit count (VLEN=128 only), just do
vcpop(x-'0' < 10)
, it's used here due to a quirk of the Neoverse V2 - set intersection, slightly slower than NEON with same VLEN
- count bytes equal to 0,-2,-1,1,2 (VLEN=128 only), RVV could use input as indices to gather 32-bit values and sum them using SWAR, probably other things possible
- digit count (VLEN=128 only), just do
svhistcnt
- Additional notes:
- vpconflict doesn't seem beneficial for histograms when measured, zen 5 results, benchmark code
- Paper on fast SIMD set intersection using an alternative set format and specialized NxM intersection kernels, without using the dedicated conflict/intersection instructions
-
Expand/Decompress vector elements
expand_epi16/32/64
/vpexpand
/vpdecompress
- decode packed varuint format, expand variable, determined by encoding, 1-4 bytes into 4 bytes
- Access Control List
- vertexcodec
- simd-sort swap
- selectively widen with default value (latin1 to utf8, yenc), could be widen+compress or viota+vrgather
- vslideup in simd-sort partition
- vslideup in utf8 conversion
- widen u32 to u64
- Intel QLP expand primitive, I wasn't able to find any third party code using this
svsplice
- combine two vectors, vslideup, often you can just vcreate
- Additional notes:
- The spec shows how to synthesize a decompress operation via viota+vrgather
-
Extract 8 unaligned bytes from u64
multishift_epi64
/vpmultishiftqb
- base64 encode (WojciechMula/base64simd, php/php-src, aklomp/base64):
gather [........|ccdddddd|bbbbcccc|aaaaaabb]
->multishift [bbbbcccc|ccdddddd|aaaaaabb|bbbbcccc]
->[00dddddd|00cccccc|00bbbbbb|00aaaaaa]
could also be done via a bit-compress - decompress example from intel optimization guide, could also be done via a bit-compress
- v210 to rfc4175 (v210 without padding), uses multishift+shuffle+multishift+and+andnot+or+compress, could presumably also be done with bit-compress+widen+compress+narrow
- latin1 to utf8, fast alternative implementations available
- utf8 validation, fast alternative implementations available
- padding bit-matrix with ≤7 bit rows to 8-bit rows, could also be done via a bit-compress
- utf16 to utf8, fast alternative implementations available
- widen 4 to 8-bit, could be done with something like
vsrl_u8(vwmul_u8(x, 16), id&1 ? 0 : 4)
- base64 encode (WojciechMula/base64simd, php/php-src, aklomp/base64):
-
bit compress and expand
pext
/bext
/bcompress
- base64 decode, base64simd
vbrev8 [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
->vbcompress [00aaaaaa|00bbbbbb|00cccccc|00dddddd]
->vrgather [........|aaaaaabb|bbbbcccc|ccdddddd]
->[........|ccdddddd|bbbbcccc|aaaaaabb]
- LEB128/varint decode, used in DWARF, protobuf, WebAssembly, ...
- regex matching: heavy usage in hyperscan/vectorscan
- genomics: claimed in "Introduction to SVE2", de/en-coding of nucleotides
- Board games:
- UTF-8 decode in GPR
- Picnic: Post-Quantum Signatures
- Adaptive Linearized Tensor Order format
- Space-filling curves: 2D Morton Z-order to Cartesian coordinates, Hilbert curve
- bit-deinterleave: CBQN
- hex to ASCII
- find indices of set bits
- narrow: 8/16 bit to packed bits,
- pack data structure
- align previously created mask with elements after compression
- convert compression mask to mask of compressed elements, probably not needed, since RVV has vl predication
- base64 decode, base64simd
pdep
/bdep
/bexpand
- base64 encode, base64simd
vrgather [........|ccdddddd|bbbbcccc|aaaaaabb]
->vbexpand [........|aaaaaabb|bbbbcccc|ccdddddd]
->vbrev8 [00aaaaaa|00bbbbbb|00cccccc|00dddddd]
->[00dddddd|00cccccc|00bbbbbb|00aaaaaa]
- LEB128/varint encode, used in DWARF, protobuf, WebAssembly, ...
- regex matching: heavy usage in the hyperscan/vectorscan
- genomics: claimed in "Introduction to SVE2", de/en-coding of nucleotides
- Board games:
- UTF-8 encode in GPR
- bit striping rgb pixel data
- Quantum circuit simulator
- Space-filling curves: Cartesian to 2D Morton Z-order
- bit-interleave: CBQN
- index of nth set bit within element
- unpack sequence of packed 1/2/3/4/5-bit values
- unpack data structure
- rank select, mask n'th least-significant set bit
- limit mask to 32 set bits
- widen: CBQN (conditionally once or twice), widen vector mask,
- base64 encode, base64simd
bgrp
/ Sheep-and-goats- arbitrary bit permutation in ceil(log2(nbits)) Sheep-and-goats (SAG) operations
- Othello evaluation function
- Additional notes:
- 8 μops per set bit on Zen1/2
- "RISC-V Extensions for Bit Manipulation Instructions" paper:
- bcompress/bexpand took about 3x more gates than a barrel shifter (3.3K vs 0.9K Gates)
Shouldn't
vwmaccu.vx(vwaddu.vi(a, b), -1, b)
bevwmaccu.vx(vwaddu.vv(a, b), -1, b)
?