Skip to content

Instantly share code, notes, and snippets.

@camel-cdr
Last active March 23, 2025 19:58
Show Gist options
  • Save camel-cdr/99a41367d6529f390d25e36ca3e4b626 to your computer and use it in GitHub Desktop.
Save camel-cdr/99a41367d6529f390d25e36ca3e4b626 to your computer and use it in GitHub Desktop.
RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

Note: To verify my RVI membership and idenity on this otherwise semi anonymous account: I'm Olaf Bernstein, you should be able to view my sig-vector profile, if you are a member of the vector SIG.

The goal of this document is to explore gaps in the current RISC-V Vector extensions (standard V, Zvbb, Zvbc, Zvkg, Zvkn, Zvks), and suggest instructions to fill these gaps. My focus lies on application class processors, with the expectation that suggested instructions would be suitable to become mandatory or optional instructions in future profiles.

I'll assume you are already familiar with RVV, if not, here is a great introduction and here the latest RISC-V ISA manual.

To determine which applications aren't currently suitably covered by the RISC-V Vector extensions (RVV) I collected usage examples of the SIMD instructions present in AVX512 and SVE, that don't have an RVV equivalent. You can find my notes at the bottom of this article. I'll use it as a source of truth when arguing for or against the inclusion of certain instructions.

Please be aware that I'm not a hardware designer, so please take anything I say with a grain of salt and feel free to correct my mistakes/wrong assumptions.

Proposal TLDR

  • vmslide1up.m, vmslide1down.m: slide mask up/down by one
  • vtrn1.vv, vtrn2.vv: 2x2 transpose instructions
  • vwzip.vv: zip/interleave two vectors
  • unzip.vv or vnsrl defined for e64
  • vbcompress, vbexpand: element-wise bit-compress/expand
  • vbmatxor: 8x8 bit-matrix multiply, x86 calls it vgf2p8affineqb
  • vscansum.v vd, vs1 / vscanmaxu.v vd, vs1: +-scan / max-scan
  • vscansum.v vd, vs1, vm / vscanmaxu.v vd, vs1, vm: segmented +-scan / unsigned-max-scan:
  • viotar.m: segmented viota ("r" for resetting)
  • vmsxff.m: xor-scan on mask bits ("set-xor-from-first mask bit")

Should be added to RVV

I believe the following instructions should be added to RVV, as they can solve problems that currently don't have a good solution.

Slides for mask registers

Sliding mask registers is a very common operation that is currently not well addressed by RVV, because dedicated instructions are missing.

In many cases, it's possible to use element slides instead of sliding just the mask register. This is however a fundamentally more complex operation and performs a lot worse in practice, especially when the relevant vectors use LMUL>1. It also often requires you to duplicate comparisons or other operations done before the element slide.

Mask slides can be emulated directly, but this is expensive to do portably:

# slide up mask in v0 by 1-bit
vsetvli t1, zero, e8, MF8, ta, ma # MF8 needs to be MX/8
vslide1up.vx v1, v0, a0 # slide in next
vsrl.vi v1, v1, 7  # v2 = v1 >> 7
vadd.vv v2, v0, v0 # v2 = v0 << 1
vor.vv  v0, v1, v2 # v0 = v1<<7 | v0<<1
vsetvli zero, t0, e8, MX, ta, ma

When measured, these 6 instructions still perform better than sliding the elements and repeating the comparison: SpacemiT X60, XuanTie C908, Saturn

If you write vector length specific (VLS) code and know you are working with 64 elements or fewer, then you only need a single vsll.vi including a vtype change.

The current situation is a mess with no good solutions, especially for vector length agnostic (VLA) code.

Most use-cases of mask slides could be addressed using just two one source, one destination instructions: vmslide1up.m, vmslide1down.m Where vmslide1up.m slide the mask register bits up by one and vmslide1down.m down by one.

I don't think a GPR-based slide variant is particularly needed and slides further than one or two are also quite rare. A dedicated slide by 2-bit or 3-bit immediate instruction would however have the added benefit of allowing you to slide an entire LMUL=1 vector register by an immediate, this is sometimes useful for slides smaller than 8 bits. If this doesn't add much additional complexity a full register slide bits by small immediate instruction would also be an option.

Element transpose (vtrn1/vtrn2) and zip (vwzip) instructions

RVV already has a very powerful permutation instruction with vrgather and vcompress, however some permutations are so common that dedicated instructions are required, as they can have lower latency and higher throughput.

The need for dedicated instructions for these very common permutations has already been discussed at length by RISE in the context of video codecs, see: "RISCV64 new vector instructions requirements for video/multimedia" and "Vector Operations Proposals - Draft Document".

Here I'd just like to highlight a few additional aspects:

  • The implementation complexity of a vtrn1/vtrn2 is similar to bitwise operations, as such huge performance gains are possible
  • vwzip can already be implemented with vwmaccu.vx(vwaddu.vv(a, b), -1, b), as suggested by Craig Topper. These are two EMUL=2 instructions vs two EMUL=1 instructions if we had dedicated instruction, which would give us performance parity with existing ISAs.
  • I prefer a vwzip over the vzip2a/vzip2b proposed by RISE, because it works better with vl<VLMAX and fractional LMUL. Using vzip2a/vzip2b would require an additional vslideup when vl<VLMAX or for fractional LMUL, unless you want the result to be split up between two vector registers.
  • unzip1/unzip2 is vnsrl.wi/wx, which already works for everything but 64-bit elements, which currently require a vcompress or vrgather to unzip. Instead of a dedicated unzip instruction, it may be preferable to define vnsrl at SEW=64 (EEW=128), which would result in a more versatile instruction.

Missing bit permutations

With the Zvbb ("Vector Basic Bit-manipulation") extension, RVV gained a set of basic bit-manipulation instructions, as the name suggests. This notable includes byte reverse, bit-reverse, count leading/trailing zeros, popcount and bit-rotates.

RVV is however still missing some advanced bit permutation instructions.

Bit compress and expand (vbcompress / vbexpand)

Some of the most powerful bit manipulation instructions present in today's ISAs are the bit-compress and bit-expand instructions.

  • bit-compress takes two n-bit arguments and moves the bit in the first argument corresponding to the nth set bits in the second argument to the nth position of the destination. (bcompress(0bHGFEDCBA,0b11110010)=0bHGFEB
  • bit-expand takes two n-bit arguments, moves the nth bit in the first argument to the nth set bit in the second one. (bdecompress(0b000HGFEB,0b11110010)=0bHGFE00B0)

Implementing bit-compress/expand with existing RVV instructions is really expensive. You either have to use a large series of shifts and bit operations to shift the bits into place, or convert each bit into a byte and do the vcompress.vm/vrgather on 8-bit elements with an 8 times larger LMUL.

x86 implements these instructions as pext/pdep under BMI2, but they only operate on GPRs, while ARM SVE only defines the bext/bdep instructions which operate on elements of vector registers.

Among others, they can be used for: bit interleaving/packing; base64 de/encoding; LEB128 de/encoding, which is used DWARF, Dalvik, protobuf and WebAssembly; regex matching; genomics; various board games; Space-filling curves; Quantum circuit simulation; ...

Here are is an example to illustrate how these instructions could be used to de- and encode base64:

base64 decode:                                    | base64 encode:
vbrev8     [00dddddd|00cccccc|00bbbbbb|00aaaaaa]  | vrgather [........|ccdddddd|bbbbcccc|aaaaaabb]
vbcompress [00aaaaaa|00bbbbbb|00cccccc|00dddddd]  | vbexpand [........|aaaaaabb|bbbbcccc|ccdddddd]
vrgather   [........|aaaaaabb|bbbbcccc|ccdddddd]  | vbrev8   [00aaaaaa|00bbbbbb|00cccccc|00dddddd]
           [........|ccdddddd|bbbbcccc|aaaaaabb]  |          [00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Bit compress and expand were the two instructions with the most real-world usage of the advanced SIMD instructions surveyed. There is just one problem, historically some implementations really struggled to implement these instructions well:

all intel, haswell to alderlake Zen1/Zen2 Zen3/Zen4 Zen5 Cortex A510/A520 Neoverse N2/V2/V3 Cortex A710/X2/X3/X4 Cortex A715/A720
Latency 3 8 μops per set bit 3 3 14-70 6 4
RThroughput 1 8 μops per set bit 1 0.33 14-70 2 2

The first two generations of Zen used a microcoded implementation with 8 μops per set bits, the Cortex-A510/A520 implementations are similarly bad. On the other hand, all intel processors since haswell managed to implement it with a three-cycle latency and throughput of 1 instruction per cycle. The other ARM cores also have a solid implementation, with a 4-6 cycle latency and a throughput of one instruction every two cycles.

Fortunately, we have multiple papers that give more details on the implementation challenges:

This seems to suggest that the implementation complexity is reasonable, although the decode of the mask to control signals is disproportionately expensive.

These instructions should be added to RVV, since they are very widely applicable, can help accelerate working with very common data formats and the costs seem reasonable.

Several ways to reduce the implementation cost can be explored if feedback from implementors suggests that the implementation cost is too high:

  • Since the mask operand is often a fixed constant, we could decide to only add vbcompress.vx/vbexpand.vx variants, such that only a single mask needs to be decoded.
  • There could be separate extensions for SEW≤32 and SEW≤64 support, as a lot of the interleaving use cases can operate on fewer than 64 bits.

Aside: No need for Sheep-and-goats

The Sheep-and-goats (SAG) operation, or bgrp as it's called in SVE, groups bits to right or left as selected by bitmask ((bcompress(x, ~m) << popc(m)) | bcompress(x, m)). It can perform an arbitrary bit permutation in ceil(log2(nbits)) chained SAG operations.

While this is quite powerful, it's very rarely used and can already be done by combining two vbcompress instructions with a single vslideup instruction.

I would only consider this worth adding if a hardware implementation of vbcompress and vbexpand gets this operation basically for free. This doesn't seem to be the case, here adding SAG increases the area by 50% over just implementing vbcompress/vbexpand`.

Going with this 50% area increase, I'd rather hardware designers use these resources to make vbcompress/vbexpand` faster than adding a dedicated SAG instruction.

Bit-matrix operations (vbmatxor / vgf2p8affineqb / MXOR)

This instruction is present in Knuth's MMIX as MXOR and modern x86 processors with support for GFNI as VGF2P8AFFINEQB. It was also considered for the RISC-V scalar bit manipulation extensions but didn't end up in the final specification.

The instruction interprets a 64-bit number as a 8x8 bit matrix, and multiplies two of such 8x8 bit-matrices together, replacing addition with XOR and multiplication with AND:

Assume that the 64 bits in $Y are numbered as follows (and similar for the bits in register $Z and $X): Y₀₀,Y₀₁...Y₀₇,Y₁₀,Y₁₁...Y₁₇,...,Y₇₀,Y₇₁...Y₇₇

Now bit Xᵢⱼ in register register $X is computed as follows: MXOR: Xᵢⱼ = Y₀ⱼ & Zᵢ₀ ^ Y₁ⱼ & Zᵢ₁ ^ ... ^ Y₇ⱼ & Zᵢ₇

This might not immediately sound useful, but it has a lot of unexpected uses.

Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte (Why Ice Lake is Important (a bit-basher’s perspective)")

Although creating the constants needed for your permutation is non-trivial, the x86 vgf2p8affineqb instruction still sees a lot of real-world use.

Prominent examples include: Arbitrary Modular GF(2ʷ) Multiplication, RAID error correction, PAR2, SM4, emulating missing 8-bit AVX instructions, sub-byte arithmetic, very fast SIMD byte histogram, abs(x) >> shift in av1 decode, 8x8 and 16x16 bit matrix transpose, indices to set bits

It can be used to emulate a bunch of 8-bit instructions that already exist in RVV, which may seem redundant. However, if this instruction was added to RVV it should operate on EEW=64 EMUL=LMUL, which would allow for doing the 8-bit operations without a vtype change, removing up to two vsetvlis from certain code paths.

The hardware implementation cost seems manageable.

Claire Wolf shared the following on the isa-dev mailing list:

The HW cost for bmat[x]or is not as big as one might expect. My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.

"Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications" also covers this instruction and their single cycle implementation takes 2.4K NAND gates.

Existing vgf2p8affineqb implementations corroborate this:

vgf2p8affineqb ymm, ymm, ymm, i8 icelake tigerlake rocketlake alderlake-E alderlake-P Zen4
Latency 5 5 5 5 3
RThroughput 0.5 0.5 0.5 0.5 0.5 0.5

Due to the versatility and reasonable hardware cost, I suggest adding the vbmatxor.vv and vbmatxor.vx instructions. The .vx variant is included, because most uses of vgf2p8affineqb broadcast a 64-bit constant to perform the same 8-bit permutation on all bytes.

Scan operations with segmented variants

Scan operations, also called prefix operations, are reductions that write back the intermediate results, effectively computing all partial reductions.

RVV already has a few "hidden" scan instructions that operate on mask registers:

  • vmsbf(~m) is an and-scan on mask-bits
  • vmsif(m) is an or-scan on mask-bits
  • viota(m) is an +-scan on mask-bits that is widened to SEW

The case for segmented scans

Scan operations on vector elements and segmented variants are very important missing instructions because they allow you to efficiently parallelize an entirely new paradigm of problems. Segmented scans, scan multiple segments of elements separately, as encoded by a mask. A set bit in the mask indicates that a new segment starts at this position. Implementing this using existing RVV instructions is very expensive.

Segmented scans allow you to efficiently process multiple variable-length sequences of input in parallel. This is best illustrated by an example.

Consider vectorizing run-length encoding, that is, turning an input of bytes into pairs of bytes encoding a repetition count and a byte value to repeat. With the classical SIMD/vector model, you can only process one "run" at a time, even though a vector register might contain multiple runs. While this isn't a big problem for smaller vector lengths, it can be a huge source of inefficiency when dealing with longer vectors. Using a segmented max-scan allows you to process all the runs in a vector register independently:

RLE using +-scan:
0: 8 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 | v0 as input
1: 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 9 | v1 = slide1up(v0, v0[0]+1)
2: 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 | m0 = v0 != v1     __
3: 0 e 0 0 b 0 9 8 0 0 0 0 3 2 0 0 | v2 = m0 ? vid : 0   \
4: e e b b b 9 9 8 3 3 3 3 3 2 0 0 | v3 = +-scan(v2, 0)  | does seg-viota(m0)
5: 1 0 2 1 0 1 0 0 4 3 2 1 0 0 1 0 | v4 = vid - v3     __/
6: 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 | m1 = m0 >> 1
7: 7 8 7 8 7 8 | val     = compress(v0, m1)
8: 2 1 0 4 0 1 | repeats = compress(v4, m1)

Notice how the above code uses a max-scan to implement a segmented viota. This works for both seg-viota and seg-+-scan:

seg-+-scan:
 6  5  4  3  2  1 | v0
 1  0  1  0  0  1 | m0                      |   seg-viota:
 0  1  0  1  0  0 | m1 = m0 >> 1            |   0 0 1 0 0 1 1 0 0 0 0 1 | m
21 15 10  6  3  1 | v1 = +-scan(v0)         |   0 0 9 0 0 6 5 0 0 0 0 0 | v1 = m ? vid : 0
 0  0  0  0 15  6 | v2 = vcompress(m0,v1)   |   9 9 9 6 6 6 5 0 0 0 0 0 | v2 = +-scan(v1,0)
 0  0  0 15  6  0 | v3 = vslide1up(v2,0)    |   2 1 0 2 1 0 0 4 3 2 1 0 | seg-viota = vid - v2
 2  1  1  0  0  0 | v4 = viota(m1)
15  6  6  0  0  0 | v5 = vrgather(v3,v4)
 6  9  4  6  3  1 | seg-+-scan(v0,m0) = v1 - v5

AVX512 can also decode multiple runs in parallel using the more complex vpconflict instruction from the AVX512CD extension:

0: v0 as input
1: cfss    = _m512_conflict_epi32(v0)
2: lzcnt   = _m512_lzcnt_epi32(cfss)
3: m       = _mm512_movepi32_mask(cfss)
4: scfss   = _m512_sllv_epi32(cfss, lzcnt)
5: nscfss  = _m512_andnot_epi32(scfss, ~0)
6: rl      = _m512_lzcnt_epi32(nscfss)
7: repeats = _mm512_maskz_compress_epi32(m, rl)
8: val     = _mm512_maskz_compress_epi32(m, v0)

This took the same instruction count as the +-scan variant but only works for 32 and 64-bit elements.

Falvuy noted that a slightly different storage format can be vectorized to handle multiple runs at ones using the element compress instruction, as described in this paper.

An example that AFAIK can't be efficiently implemented in AVX512 is parsing a series of integer numbers, in this case up to 16-bit, in parallel. The best x86 implementation I know of uses a permutation lookup table with 2¹⁶ entries to distribute elements of a 128-bit vector into power-of-two-sized chunks, such that they can be summed using multiple horizontal adjacent pair addition instructions.

Using seg-+-scan this becomes possible for arbitrary VLEN, here is a simplified implementation:

   _    5    6    7    8    _    5    6    _    5    6    7    _  | v0 = vsub(input, '0')
   _    7    6    5    _    6    5    _    8    7    6    5    _  | v1 = vrgather(v0, reverse)
   1    0    0    0    1    0    0    1    0    0    0    0    1  | m0 = vmsgt(v1, 9)
   0    3    2    1    0    2    1    0    4    3    2    1    0  | v2 = seg-viota(m0)
   0  100   10    1    0   10    1    0 1000  100   10    1    0  | v3 = vrgather(pow10tbl, v1)
   0  700   60    5    0   60    5    0 8000  700   60    5    0  | v4 = vwmul.wv(v3, v1)
   0  765   65    5    0   65    5    0 8765  765   65    5    0  | v5 = seg-+-scan(m0, v4)
   0    1    0    0    0    1    0    0    1    0    0    0    0  | m1 = m0 >> 1
   765 65 8765 | v6 = vcompress(v5, m1)

This implementation directly widens the elements to 16-bit. An optimized implementation should in place combine and widen adjacent elements (A*10+B), using a seg2 load and vwmul.

There are a lot of problems that benefit from scans and segmented scans: Line-of-Sight Problem, prefix sum, average word length, sparse matrix-vector multiplication (SpMV), sum of values from neighbors in a graph, Minimum Spanning Tree, Maximum Flow, ...

Note that seg-copy-scan or segment-broadcast used in some of the linked problems is just trivially compress + viota + vrgather.

"Vector Models for Data-Parallel Computing" covers even more use cases, if one were to include segmented split (element-wise Sheep-and-goats), which is probably too expensive to implement in practice. Segmented split can be implemented in <15 instructions using segmented viota and a lot more than that without scan instructions.

To reiterate, segmented scans allow you to efficiently vectorize new kinds of problems and are especially important to get good utilization on larger vector registers.

New instructions are only useful if they are implementable at reasonable speed and implementation cost. I'd like to argue that implementing scan counterparts to existing reduction instructions would not provide a significant hardware challenge.

The first half of the work-efficient parallel scan implementation already matches the reduction tree used in reductions. The second half is just a mirrored version of the up-sweep with offset inputs and outputs.

scan

Segmented variants are conceptually a bit more complicated, but don't require much additional hardware. You need to propagate a single additional bit, from the segment starts, through the scan tree, which pins the value, such that subsequent steps don't change it. (see above)

I was not able to find any reports on hardware implementations, especially ones on how difficult it is to reuse reduction circuitry to implement scans would be insightful.

One software implementation reported that implementing segmented variants on top of existing scan instructions is quite straightforward:

On the CRAY Y-MP, the asymptotic running time of the plus-scan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the un-segmented versions. (source)

Assuming additional implementation complexity of segmented scan versions of existing reductions is reasonable, such instructions should be added to RVV, because they allow vectorization of new kinds of problems, while efficiently utilizing even very long vector registers.

Possible instruction proposals

While the reduction instructions take a second register source for the initial value, I'd recommend dropping this for the vector scans to conserve encoding space. Similarly, I wouldn't add separate instructions for segmented scans but rather define the masked variants of scans to work as segmented scans. AFAIK, masked scans and masked segmented scans are rare and can be implemented with additional instructions if needed.

As a minimal set of scan instructions, I would propose the following:

  • +-scan/unsigned-max-scan: vscansum.v vd, vs1, vscanmaxu.v vd, vs1
  • segmented +-scan/unsigned-max-scan: vscansum.v vd, vs1, vm, vscanmaxu.v vd, vs1, vm
  • segmented viota: viotar.m, the "r" for resetting, which described the behavior: it counts up until a mask bit is set, at which point it resets to zero and continues

A full set of scan instructions corresponding to existing reductions would include the following additions:

  • unsigned min-scan: vscanminu.v vd, vs1
  • signed min/max-scan: vscanmin.v vd, vs1, vscanmax.v vd, vs1
  • floating-point +-scan: vfscanusum.v vd, vs1
  • including segmented versions: vscanminu.v vd, vs1, vm, vscanmin.v vd, vs1, vm, vscanmax.v vd, vs1, vm, vfscanusum.v vd, vs1, vm

It could also include element-wise and/or/xor scans, but I don't see a need for these bitwise scans.

and-scan and or-scan already exist on mask registers, where they are actually useful:

  • vmsbf(~m) is an and-scan on mask-bits
  • vmsif(m) is an or-scan on mask-bits

Coincidentally, the next section will argue for a xor-scan on mask bits.

xor-scan/prefix-xor for masks

This operation is often used in SIMD parsing to match delimiters and is equivalent to carryless multiply with -1.

Usage: simdjson, json link simdzone, simdcsv

It can be emulated using the following instruction sequence:

# (xor_scan(00100100)>>1) == 00011100, it's offset by one
vsetvli t0, x0, e8, m1, ta, mu # requires `EMUL=LMUL*(8/SEW)`
viota.m  v1, v0
vand.vi  v1, v1, 1
vmseq.vi v0, v1, 1

The above is three LMUL=1 instructions when LMUL*(8/SEW)≤1, but you may need to change the surrounding code to accommodate the offset when porting existing code.

The following variant will likely be faster when LMUL*(8/SEW)>2:

# xor_scan(00100100) == 00011100
li t0, -1
vsetvli t0, x0, e64, m1, ta, mu
vclmul.vx v1, v0, a1 # 64-bit parallel-prefix XOR
vmslt.vi  v0, v1, 0 # top bit set
viota.m   v2, v0 # parallel-prefix XOR of top bits
vand.vi   v2, v2, 1
vmseq.vi  v0, v2, 1
vnot.v    v1, v1, v0.t # apply flip

Expected RThroughput for SEW=8 from SiFive P670 llvm-mca model:

method m1 m2 m4 m8
dedicated instruction 1 1 1 1
viota 3 4 8 16
vclmul+viota 6 6 6 6

Vector length-specific code could have a specialized VLEN≤512 path, that uses a single vclmul.vx to do the prefix-xor at LUML=1 in a single instruction.

The current situation isn't bad however you get the implementation almost for free if your processor already implements vclmul.

Because it's so cheap to implement, uses minimal opcode space and to encourage VLA code, it would be advantageous to add this as a dedicated instruction.

Notes on naming:

While vmxorscan would be a possible name, it doesn't quite fit with the existing mask instructions:

  • or-scan is vmsif(m) "set-including-first mask bit"
  • and-scan is vmsbf(~m) "set-before-first mask bit"

To harmonize the names, vmsxff for "set-xor-from-first mask bit" might be a better alternative.

I'm not sure if they are needed, but they would be nice to have

Mask and element reversal

Reversing elements and mask-bits isn't a common operation, and there are already performance-portable ways of implementing them for application processors:

element-reverse:
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid
# To reverse LMUL>1 vrgather at LMUL=1 seperatly and use moves to swap LMUL=1
# vector registers

mask-reverse:
vbrev.v v0, v0
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid

However, there are a few caveats that make dedicated instructions attractive.

  • Firstly, both of these reversals require vl=vlmax and require an additional vslidedown when vl<vlmax.
  • Secondly, both approaches don't work for SEW=8 with VL>256, and require vrgatherei16 and 16-bit indices instead, which often performs worse.
  • Thirdly, while using vrgather to reverse elements will perform well on sane application class processors, but on other targets, SEW=8 vrgather performs significantly worse than SEW=64 vrgather. In which case, you should probably use a SEW=64 vrgather with an additional vbrev8, when reversing a SEW=8 vector.

While I don't think this is a very common operation, a dedicated reverse instruction would make code using it easier to write and potentially faster.

Both element and mask-reversal instructions do not take a lot of encoding, as they have only one source and one destination register.

Dedicated within 128-bit lane shuffle / 16x8b LUT

For string algorithms, a fast 16x8b LUT is non-negotiable, and a lot of other integer workloads also require it or generally fast shuffles within 128-bit lanes. This is currently addressed by an LMUL=1 vrgather.vv, however scaling it to larger vector length seems impractical as complexity grows quadratically. If you have the silicon budget for a 256-bit lane vrgather.vv primitive, but a vector length of 1024, then a naive hardware implementation would need to apply the primitive (1024/256)^2=16 times.

We see this impacting existing implementations. The SiFive X280 has VLEN=512 and a native 256-wide data path, but has vl-cycles occupancy for LMUL >= 1, and a special case ~1-cycle occupancy for LMUL ≤ 1/2, the data path width. (See [issue thread](It riscvarchive/riscv-v-spec#929 (comment)))

There is a silver lining however as an implementation could detect when all indices in a lane point to the same other lane, and only execute the primitive once for that lane. This would improve the above example to 4 applications of the primitive for a lot of common cases including the 16x8b LUT, any in-lane permutation, element reversals, ...

I'd imagine that high-performance out-of-order cores would have trouble implementing this optimization, but I don't think it's as relevant for those cores. They are unlikely to use very long vectors and will have the budget to implement a full LMUL=1 vrgather.vv primitive. See existing x86 and ARM cores.

On the other hand, the high-performance out-of-order P470 and P670 cores from SiFive seem to implement this optimization for LMUL>1 vrgathers, where this doesn't even matter as much because software can just unroll into multiple LMUL=1 vrgathers if there is no lane crossing.

Nonetheless, we see in existing AVX512 implementations, that the latency of a in-lane gather is smaller than using a full lane-crossing vrgather:

latency for zmm icelake tigerlake rocketlake alderlake-P Zen4
vpshufb 1 1 1 1 2
vpermb 3 3 3 3 6

Additionally, while SVE originally didn't include 128-bit lane permute, SVE2.1 recently added it as TBLQ.

A separate instruction may simplify implementations, but it would also be an incentive to neglect the LMUL=1 vrgather.vv implementation, which is also an important operation. This should be mostly mitigated because the base V extension only has the full lane-crossing-gather, hence a lot of software will use lane-crossing operations and force vendors to implement it competitively.

So I'm not sure if a separate instruction is needed, but it seems like it would help simplify implementations and enable more performance portable RVV code.

Possible instruction proposals

A) gather within 128-bit lanes, indices are in the corresponding lane of the index vector register

  • pros
    • allows for different shuffles per lane, this seems to be a rare use case however
    • index lanes are next to data lanes, so there may be less data movement in certain architectures.
  • cons
    • requires an additional 128-bit element broadcast instruction to be more easily useful, such an instruction does have general utility though
    • increases register pressure, as a LMUL>1 gather would require an LMUL>1 index register

B) gather within 128-bit lanes, indices are in the lowest 128-bit of the index register, potentially always as 8-bit

  • pros:
    • no 128-bit broadcast needed, thus easier programming model
    • only uses a LMUL=1 index register
    • with 8-bit indices, shuffles for e16 and above can be loaded from a single 64-bit GPR (.vx variant possible, although the semantics would be weird for e8)
  • cons:
    • Use case of different shuffles per lane not supported
    • may require more data movement and increase latency compared to indices in the corresponding lanes

I don't think shuffles from immediate (like vpshufd) are needed to reduce vector register pressure since most other instructions have immediate and GPR variants. It would also require an 8-bit immediate, which would take a lot of encoding space. As such, this would only make sense as a destructive operation using the 5-bit immediate and additional 5-bits from the rs1 encoding. This would use two more bits than necessary, which could either be repurposed as additional opcode space or directly encode the EEW.

Ternary logic

AVX512 introduced the very powerful vpternlogd/vpternlogq instructions, which use an 8-bit immediate to select an arbitrary ternary logic operation.

They can express the following bitwise operations in a single instruction:

  • C xor A xor B
  • (A orn B) and C
  • C ? A : B bitwise merge
  • C ? A nor B : A xor B exactly one bit set
  • C ? A xor B : A and B exactly two bits set
  • Used in MD5: C ? A : B, A xor B xor C, A xor (B orn C)
  • Used in SHA-1: C ? A : B, A xor B xor C, C ? A or B : A and B
  • Used in SHA-2: C ? A : B, C ? A or B : A and B

RVV defines all possible binary operations for mask registers (and, andn, nand, nor, or, orn, xnor, xor), but only and, or, xor, andn for vector elements. The binary operations on vector elements do have a GPR and immediate variant, which can be quite useful.

While a fully generic ternary logic instruction would be a great design point, I don't think it would fit into the current ISA very well, as the existing logic instructions could've been defined on top of it.

Using the existing RVV binary mask operations it takes up to four instructions to simulate what vpternlog can express in a single instruction. That isn't too bad, but adding a ternary conditional selection instruction (C ? A : B) would bring it down to three or fewer instructions with a maximum dependency chain length of two instructions.

I think it would be reasonable to add the missing binary operations for vector elements, as well as a ternary conditional selection (C ? A : B), which should also have a variant for mask operations. The missing binary operations don't really need an immediate variant, and I'd be happy with just defining vector-vector variants to preserve opcode space.

I've placed this under "not sure if needed", because I'm not sure how much real world difference additional instructions would make. I'm also unsure how to quantify the impact of adding a three source instruction to the mask unit, which previously only had two source operand instructions.

Probably not needed

Obviously, this won't cover things instructions are directly available. Instead, it will cover instructions present in other ISAs, that I don't see as needed.

all-to-all comparisons

Both AVX512 and SVE provide instructions that do all-to-all element comparisons for finding and counting equal elements.

AVX512 has vp2intersect and vpconflict, which both do an all-to-all comparison, with 32 and 64-bit variants. SVE has MATCH and HISTSEG for 8 and 16-bit elements, which do an all-to-all comparison within every 128-bit lane, and HISTCNT which does a lane-crossing all-to-all comparison for 32 and 64-bit elements.

Historically vpconflictd and vp2intersect were famously quite slow on Intel:

Latency/RThroughput all intel, icelake to alderlake Zen4 Zen5
vpconflictd ymm, ymm 17/9 6/0.67 6.9/1
vpconflictd zmm, zmm 26/19.5 7/1.33 6.9/1
vp2intersectd ymm, ymm 27/11 (tiger/alderlake only) 6/0.67 ?/1
vp2intersectd zmm, zmm 42/23 (tiger/alderlake only) --- ?/1

The performance of vp2intersect on Intel was so bad, that a faster-than-native alternative software implementation was found. Intel deprecated VP2Intersect from 2023 onward, while AMD re-introduced it on Zen5 with a very performant implementation.

The SVE instructions don't have this problem, presumably because the vector lengths are all just 128-bits.

Latency/RThroughput Cortex A510 Cortex A520 Neoverse N2 Neoverse V2/V3 Cortex A710 Cortex X2/X3/X4 Cortex A715/A720
MATCH 7/4 7/4 2/1 2/1 2/1 2/1 2/0.5
HISTSEG 8/5 8/2.5 2/0.5 2/0.25 2/0.5 2/0.25 2/0.5

I could find almost no usage of the SVE instructions, while vpconflict and vp2intersect were slightly more common.

If RVV was to add such an instruction, it would need to follow SVE in restricting them within lanes, otherwise, implementations of a certain VLEN are just not reasonably possible, due to the quadratic growth of the needed comparisons. This restriction makes the instructions a lot less nice to use and, as I'll argue in the following, not needed to suitably address the most common use cases of these instructions.

Histogram

By far the most common application of these instructions is histogram computation/data binning. This involves filling a vector with indices of buckets that need to be incremented, detecting conflicts, resolve them, and finally gather+increment+scatter the values to the buckets.

Intel advertised vpconflict explicitly for allowing their compiler to auto-vectorize histogram computation. Currently, only vendor compilers are capable of this and only when the arguments are marked as restrict.

The biggest benefit of such vectorizations isn't the histogram increment itself, rather the vectorization of the rest of the calculation. Auto-vectorizing compilers could just vectorize the loop, up until the histogram increment, at which point it could iterate over the vector register entries and increment the histogram using scalar code.

Furthermore, when measured, using vpconflict doesn't seem beneficial for histograms when measured (Zen5 results, benchmark code) and even outperformed by scalar code, extremely so on AMD. This was not the case on current RVV implementations, presumably because they have very weak in-order scalar execution compared to the tested x86 processors. The out-of-order execution resources of today's processors seem to already do a great job at executing load/stores in parallel while avoiding conflicts.

There are many ways to efficiently vectorize SIMD creation without a dedicated conflict detection instruction, which cover almost all use cases:

  • Small histograms can benefit greatly by hand-tuning for the specific size: hand-tuned byte histogram without vpconflict, see the "aptroot" line in the graph below
  • On X86 and presumably, other highly aggressive out-of-order cores, using scalar code to increment the histogram values is faster than using memory gather/scatter.
  • The histogram is small enough to replicate it VL times, forgoing the need for conflict detection entirely.
  • The histogram is too large to replicate, consequently, conflicts are rare. In this case, you can quickly check if a conflict happened by doing an additional gather after the scatter write-back of incremented values to check if the difference between the original values is exactly equal to vl. If a conflict occurs, then conflicting buckets will only be incremented once, and you'll be able to tell that there weren't vl increments as expected.
  • Finally, conflict detection can always be implemented with N/2 iterations of existing instructions: https://riscv.godbolt.org/z/rEY6oM6ne

aptroot

Due to the above alternative implementation strategies covering the histogram use case almost entirely, I don't see it as a good motivation for a dedicated instruction.

Set intersection

The other slightly bigger use case for these all-to-all comparison instructions is computing set intersections and other set operations that follow from that.

I could find two libraries that used these instructions to implement set intersection, tetzank/SIMDSetOperations ashvardanian/SimSIMD. However, I couldn't find any external code using these set intersection functions.

Let's look at some benchmarks comparing implementations using the dedicated instructions with ones that aren't using them:

  • SimSIMD set intersection:
    • The SVE HISTCNT and MATCH implementations were slightly slower than the equivalent NEON variant on Neoverse V2, which has a 128-bit SVE implementation.
    • The AVX512 implementation with vp2intersect was on average only 5% faster then the AVX512 shuffle variant on Zen5. Although, the results varied a lot. The biggest speedup of 1.60x occurred when intersecting 128 with 8292 16-bit elements, while the biggest regression that was 1.25x slower when intersecting 1024 with 1024 16-bit elements.
  • SIMDSetOperations set intersection:
    • On Skylake-X: The vpconflict implementation was 1.6x slower than the AVX2 shuffle implementation.
    • Zen5: The vp2intersect implementation was 1.5x faster than the AVX2 shuffle implementation, while the vpconflict one was 1.2x slower.
    • Pretending we have a fast vp2intersect on other processors: Since I don't have access to a CPU with fast vp2intersect I modified the vp2intersect implementation to use AVX2 and replaced the vp2intersect instruction with a sequence of instructions that have the same latency, as vp2intersect has on Zen5, so 7 cycles. Since AVX2 doesn't have a compressed store, I replaced it with a regular store. This will produce the wrong result, but should give an indication of the expected performance if a fast vp2intersect instruction was available.
      • Skylake-X: fake-vp2intersect was ~1.35x faster
      • Zen1: fake-vp2intersect was ~1.1x faster
      • Coffe Lake: fake-vp2intersect was ~1.3x faster

That's not a large improvement for an instruction of quadratic implementation complexity.

The SIMDSetOperations benchmark intersected two large, equal-sized, sets with ~1/3 intersecting elements. Often this is not the case and one set is a lot larger than the other, e.g. in keyword searches and graph analytics. (source)

For intersecting very large sets with smaller ones, there are alternative algorithms that don't benefit much from such instructions. This paper on fast SIMD set intersection proposes precomputing a bloom filter-like structure to represent the large set. It then uses specialized NxM intersection kernels to resolve matches more efficiently:

FESIA

This approach could be expanded by batching multiple NxM intersections with the same dimensions and resolving them using a VL/K MxN intersection kernel. For instance, the following code could resolve VLEN/64 2x4 intersections of 32-bit elements at once:

# inputs: LMUL=1 v2 # ...BAbaBAba
#         LMUL=2 v4 # ...DCBAdcbaDCBAdcba
# output: v0 # intersecting elements
vsetvli x0, a0, e32, m1
vtrn1.vv v6, v2, v2 # ...AAaaAAaa
vtrn2.vv v8, v2, v2 # ...BBbbBBbb
vwzip.vv v10, v6, v6  # ...AAAAaaaaAAAAaaaa
vwzip.vv v12, v8, v8  # ...BBBBbbbbBBBBbbbb
vsetvli x0, a1, e32, m2
vmseq.vv v0, v4, v10
vmseq.vv v1, v4, v12
vmor.mm v2, v0, v1
vcompress.vm v0, v4, v2
# Note: I'm using vtrn/vwzip as proposed earlier.
#       It could also be implemented two vrgathers instead,
#       but for that you'd want to create the indices outside of the main loop.

This approach allows you to get almost full utilization of your vector length.

A dedicated full-register intersection instruction wouldn't be that beneficial, because the sub-intersections are usually small. What could be beneficial, is a lot of dedicated instructions for some of the larger or harder-to-implement kernels, e.g. for multiple 3x4, 3x5, 4x5, ... intersections.

other uses

The user uses I could identify were:

  • sparse vector dot product, is functionally equivalent to set intersection,
  • Run-length encoding, can more efficiently be solved with segmented scans, see above.
  • there were a handful of uses where vpconflict was directly followed by memory gather/scatter, but I wasn't clearly able to tell if it's a histogram in disguise. Still, in these cases the memory accesses are likely going to be the bottleneck.

The implementation cost of dedicated intersection/conflict detection instructions is very high, and the benefits aren't clear. Since the most common applications can be efficiently vectorized using other methods, I don't consider this good candidate instructions for general-purpose application class processors.

If vendors see a need for dedicated instructions similar to the ones discussed above for specific applications, e.g. database applications, then it's certainly worth considering. Such applications would probably benefit from more specialized instructions that match the exact usage of the application. In the intersection algorithm described above, that may be many instructions that do multiple small intersections of different sizes in parallel.

Please share feedback and contribute suggestions

Feel free to share feedback and suggest additions and alternative implementations.

Advanced SIMD instructions usage survey notes

Expand me
@nibrunie
Copy link

Shouldn't vwmaccu.vx(vwaddu.vi(a, b), -1, b) be vwmaccu.vx(vwaddu.vv(a, b), -1, b) ?

@nibrunie
Copy link

for the bit-compress, I don't think a vrgather is actually required. what about

  • vadc.vvm to expand the bit mask to byte (with both vs2 and vs1 initialized to zero)
  • vcompress.vm to extract set bits (I agree this requires a way larger LMUL)
  • vmseq.vi, to convert back from byte to bit mask

But maybe this is what you hinted to in the "a large series of shifts and bit operations".

@camel-cdr
Copy link
Author

@nibrunie Yes it should've used the .vv variant, thanks.

For bit-compress/expand I was thinking about vmerge.vim + vrgather.vv + vmseq.vi. For bit-compress you could use a vcompress.vm instead of the vrgather.vv, but bit-expand would still need vrgather.vv

With "a large series of shifts and bit operations" I meant that you may be able to replace a specific compression/expansion mask with a series of bit operations.

I've adjusted the text a bit.

@drom
Copy link

drom commented Nov 11, 2024

@camel-cdr coming from DSP workload (SDR-like) perspective. Would you like to discuss fixed-point implementations with rounding and saturation?

@camel-cdr
Copy link
Author

Hi @drom,
In theory yes, however I don't have an experience in that regard.
Somewhat related, Courmisch mentioned that "signed to unsigned narrowing clip" is currently a pain point.

@ubc-guy
Copy link

ubc-guy commented Dec 7, 2024

the mask slides are an interesting idea.

(1) left/right shifting doesn't make sense, because these are ordered by element number not powers of 2; instead, I recommend using the terms up/down to match the direction of regular vector element slides.

(2) wouldn't the most common usage be to align the mask shift with a regular vector element slide? maybe the regular slides need a variant that also shift the mask at the same time by the same amount. the question remains: what bit value (0 or 1) to be used for the newly introducded bits? this would depend upon how that mask was originally generated, and likely would need to align with the new values being shifted in. [vslide1up/vslide1down introduce a new value from rs1, either from X or F registers; vslideup leaves elements at the bottom unchanged; vslidedn copies values from the source vector as if it was a longer vector until it hits VLMAX, then it copies 0s.] for vslideup/vslidedown, the mask bits would likely follow the full element policy (and introduce 0s on slidedown when going past VLMAX). for vslide1up/1down, a new value is being introduced, so it matters how the mask itself was origiinally generated -- it is not clear that the originally proposed constant 0/1 is the right answer here either.

@camel-cdr
Copy link
Author

camel-cdr commented Dec 7, 2024

@ubc-guy I've added a line at the top to verify my identity.

(1) left/right shifting doesn't make sense, because these are ordered by element number not powers of 2; instead, I recommend using the terms up/down to match the direction of regular vector element slides.

Sounds good, I've adjusted that.

(2) wouldn't the most common usage be to align the mask shift with a regular vector element slide?

I haven't seen that use case yet.

The most common case I've seen is, if you need to check if one or more values follow each other.
This comes up all the time in parsing, e.g. for handling escape characters, or to identify the starts of contiguous set bits (andn(m, m<<1).
It could also be used to propagate carry within a vector register, for bigint calculations within a single vector register.

I don't think there is enough value in an additional source register to determine the bit to shift in, to justify the huge increase in used opcode space.
A fast way to insert and extract the first and last bits to and from GPRs might be a better alternative.

This currently requires a four instruction sequence + vtype change:

1

(from my mastodon post)


As a side note: RVV needs good support for mask operations to be competitive with contemporary fixed length ISAs like AVX512 in VLA code.

Fixed length ISAs can move their mask registers to GPRs and benefit from the immensely optimized scalar instruction execution. On modern processors this means you effectively have 4-6 issue mask operations.
Contrast that with prominent RVV implementations (going with public information):

  • SiFive P470, P670 and P870 all have only a single mask unit, so presumably single issue mask instructions (confirmed for P470/P670 through llvm scheduler model), at most two issue on the P870 as it has two vector execution units
  • Tenstorrent Ascalon: has two vector execution units, so at most dual issue mask instructions
  • Ventana Veyron V2: one mask unit, so presumably single issue mask instructions

1-2 issue mask instructions on RVV implementations vs 4-6 on other ISAs... Why is RVV still competitive? Because of LMUL!

LMUL amortizes scalar and mask instructions, because masks are stored in a single LMUL=1 vector register and usually execute in constant time regardless of LMUL.
Most problems can run at LMUL=4 or LMUL=2, so with just two issue for mask instructions, you get the equivalent of 4-8 issue on most workloads.

This has one big requirement, though: Namely that mask operations need to stay within mask registers.

This is why dedicated mask instructions for shifts are so important.
If it was just LMUL=1, then sliding the source register (or overlapping loads) and doing an additional comparison would work and be fast for most use cases. But then your mask operations wouldn't be competitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment