RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

Note: To verify my RVI membership and idenity on this otherwise semi anonymous account: I'm Olaf Bernstein, you should be able to view my sig-vector profile, if you are a member of the vector SIG.

The goal of this document is to explore gaps in the current RISC-V Vector extensions (standard V, Zvbb, Zvbc, Zvkg, Zvkn, Zvks), and suggest instructions to fill these gaps. My focus lies on application class processors, with the expectation that suggested instructions would be suitable to become mandatory or optional instructions in future profiles.

I'll assume you are already familiar with RVV, if not, here is a great introduction and here the latest RISC-V ISA manual.

To determine which applications aren't currently suitably covered by the RISC-V Vector extensions (RVV) I collected usage examples of the SIMD instructions present in AVX512 and SVE, that don't have an RVV equivalent. You can find my notes at the bottom of this article. I'll use it as a source of truth when arguing for or against the inclusion of certain instructions.

Please be aware that I'm not a hardware designer, so please take anything I say with a grain of salt and feel free to correct my mistakes/wrong assumptions.

Proposal TLDR

vmslide1up.m, vmslide1down.m: slide mask up/down by one
vtrn1.vv, vtrn2.vv: 2x2 transpose instructions
vwzip.vv: zip/interleave two vectors
unzip.vv or vnsrl defined for e64
vbcompress, vbexpand: element-wise bit-compress/expand
vbmatxor: 8x8 bit-matrix multiply, x86 calls it vgf2p8affineqb
vscansum.v vd, vs1 / vscanmaxu.v vd, vs1: +-scan / max-scan
vscansum.v vd, vs1, vm / vscanmaxu.v vd, vs1, vm: segmented +-scan / unsigned-max-scan:
viotar.m: segmented viota ("r" for resetting)
vmsxff.m: xor-scan on mask bits ("set-xor-from-first mask bit")

Should be added to RVV

I believe the following instructions should be added to RVV, as they can solve problems that currently don't have a good solution.

Slides for mask registers

Sliding mask registers is a very common operation that is currently not well addressed by RVV, because dedicated instructions are missing.

In many cases, it's possible to use element slides instead of sliding just the mask register. This is however a fundamentally more complex operation and performs a lot worse in practice, especially when the relevant vectors use LMUL>1. It also often requires you to duplicate comparisons or other operations done before the element slide.

Mask slides can be emulated directly, but this is expensive to do portably:

# slide up mask in v0 by 1-bit
vsetvli t1, zero, e8, MF8, ta, ma # MF8 needs to be MX/8
vslide1up.vx v1, v0, a0 # slide in next
vsrl.vi v1, v1, 7  # v2 = v1 >> 7
vadd.vv v2, v0, v0 # v2 = v0 << 1
vor.vv  v0, v1, v2 # v0 = v1<<7 | v0<<1
vsetvli zero, t0, e8, MX, ta, ma

When measured, these 6 instructions still perform better than sliding the elements and repeating the comparison: SpacemiT X60, XuanTie C908, Saturn

If you write vector length specific (VLS) code and know you are working with 64 elements or fewer, then you only need a single vsll.vi including a vtype change.

The current situation is a mess with no good solutions, especially for vector length agnostic (VLA) code.

Most use-cases of mask slides could be addressed using just two one source, one destination instructions: vmslide1up.m, vmslide1down.m Where vmslide1up.m slide the mask register bits up by one and vmslide1down.m down by one.

I don't think a GPR-based slide variant is particularly needed and slides further than one or two are also quite rare. A dedicated slide by 2-bit or 3-bit immediate instruction would however have the added benefit of allowing you to slide an entire LMUL=1 vector register by an immediate, this is sometimes useful for slides smaller than 8 bits. If this doesn't add much additional complexity a full register slide bits by small immediate instruction would also be an option.

Element transpose (vtrn1/vtrn2) and zip (vwzip) instructions

RVV already has a very powerful permutation instruction with vrgather and vcompress, however some permutations are so common that dedicated instructions are required, as they can have lower latency and higher throughput.

The need for dedicated instructions for these very common permutations has already been discussed at length by RISE in the context of video codecs, see: "RISCV64 new vector instructions requirements for video/multimedia" and "Vector Operations Proposals - Draft Document".

Here I'd just like to highlight a few additional aspects:

The implementation complexity of a vtrn1/vtrn2 is similar to bitwise operations, as such huge performance gains are possible
vwzip can already be implemented with vwmaccu.vx(vwaddu.vv(a, b), -1, b), as suggested by Craig Topper. These are two EMUL=2 instructions vs two EMUL=1 instructions if we had dedicated instruction, which would give us performance parity with existing ISAs.
I prefer a vwzip over the vzip2a/vzip2b proposed by RISE, because it works better with vl<VLMAX and fractional LMUL. Using vzip2a/vzip2b would require an additional vslideup when vl<VLMAX or for fractional LMUL, unless you want the result to be split up between two vector registers.
unzip1/unzip2 is vnsrl.wi/wx, which already works for everything but 64-bit elements, which currently require a vcompress or vrgather to unzip. Instead of a dedicated unzip instruction, it may be preferable to define vnsrl at SEW=64 (EEW=128), which would result in a more versatile instruction.

Missing bit permutations

With the Zvbb ("Vector Basic Bit-manipulation") extension, RVV gained a set of basic bit-manipulation instructions, as the name suggests. This notable includes byte reverse, bit-reverse, count leading/trailing zeros, popcount and bit-rotates.

RVV is however still missing some advanced bit permutation instructions.

Bit compress and expand (`vbcompress` / `vbexpand`)

Some of the most powerful bit manipulation instructions present in today's ISAs are the bit-compress and bit-expand instructions.

bit-compress takes two n-bit arguments and moves the bit in the first argument corresponding to the nth set bits in the second argument to the nth position of the destination. (bcompress(0bHGFEDCBA,0b11110010)=0bHGFEB
bit-expand takes two n-bit arguments, moves the nth bit in the first argument to the nth set bit in the second one. (bdecompress(0b000HGFEB,0b11110010)=0bHGFE00B0)

Implementing bit-compress/expand with existing RVV instructions is really expensive. You either have to use a large series of shifts and bit operations to shift the bits into place, or convert each bit into a byte and do the vcompress.vm/vrgather on 8-bit elements with an 8 times larger LMUL.

x86 implements these instructions as pext/pdep under BMI2, but they only operate on GPRs, while ARM SVE only defines the bext/bdep instructions which operate on elements of vector registers.

Among others, they can be used for: bit interleaving/packing; base64 de/encoding; LEB128 de/encoding, which is used DWARF, Dalvik, protobuf and WebAssembly; regex matching; genomics; various board games; Space-filling curves; Quantum circuit simulation; ...

Here are is an example to illustrate how these instructions could be used to de- and encode base64:

base64 decode:                                    | base64 encode:
vbrev8     [00dddddd|00cccccc|00bbbbbb|00aaaaaa]  | vrgather [........|ccdddddd|bbbbcccc|aaaaaabb]
vbcompress [00aaaaaa|00bbbbbb|00cccccc|00dddddd]  | vbexpand [........|aaaaaabb|bbbbcccc|ccdddddd]
vrgather   [........|aaaaaabb|bbbbcccc|ccdddddd]  | vbrev8   [00aaaaaa|00bbbbbb|00cccccc|00dddddd]
           [........|ccdddddd|bbbbcccc|aaaaaabb]  |          [00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Bit compress and expand were the two instructions with the most real-world usage of the advanced SIMD instructions surveyed. There is just one problem, historically some implementations really struggled to implement these instructions well:

	all intel, haswell to alderlake	Zen1/Zen2	Zen3/Zen4	Zen5	Cortex A510/A520	Neoverse N2/V2/V3 Cortex A710/X2/X3/X4	Cortex A715/A720
Latency	3	8 μops per set bit	3	3	14-70	6	4
RThroughput	1	8 μops per set bit	1	0.33	14-70	2	2

The first two generations of Zen used a microcoded implementation with 8 μops per set bits, the Cortex-A510/A520 implementations are similarly bad. On the other hand, all intel processors since haswell managed to implement it with a three-cycle latency and throughput of 1 instruction per cycle. The other ARM cores also have a solid implementation, with a 4-6 cycle latency and a throughput of one instruction every two cycles.

Fortunately, we have multiple papers that give more details on the implementation challenges:

"Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions" suggests that:

The hardest part of the implementation of the pdep instruction is the translation of the n-bit mask into the control bits for each stage of the butterfly network.

Additionally, they present a comparison between an implementation with and without hardware decode. The one with hardware decode used about 3x more gates, 7.6K vs 22.1K NAND gates (Table 2).
"Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications" from the same authors revisits the problem and has slightly different numbers. Their hardware decode implementation takes about 2.5x more gates, 6.6K vs 16.4K NAND gates (Table 3.2).
Claire Wolf's bextdep repo contains various bit-compress/expand implementations, including area evaluations. Of note are the up-to-64-cycles sequential implementation and the 3-stage pipeline implementation:
- up-to-64-cycles sequential: Gates: 3123, LUTs: 675
- 3-stage pipeline: Gates: 8846, LUTs: 309
"RISC-V Extensions for Bit Manipulation Instructions" bit-compress/expand took about 3x more gates than a barrel shifter (3.3K vs 0.9K Gates)

This seems to suggest that the implementation complexity is reasonable, although the decode of the mask to control signals is disproportionately expensive.

These instructions should be added to RVV, since they are very widely applicable, can help accelerate working with very common data formats and the costs seem reasonable.

Several ways to reduce the implementation cost can be explored if feedback from implementors suggests that the implementation cost is too high:

Since the mask operand is often a fixed constant, we could decide to only add vbcompress.vx/vbexpand.vx variants, such that only a single mask needs to be decoded.
There could be separate extensions for SEW≤32 and SEW≤64 support, as a lot of the interleaving use cases can operate on fewer than 64 bits.

Aside: No need for Sheep-and-goats

The Sheep-and-goats (SAG) operation, or bgrp as it's called in SVE, groups bits to right or left as selected by bitmask ((bcompress(x, ~m) << popc(m)) | bcompress(x, m)). It can perform an arbitrary bit permutation in ceil(log2(nbits)) chained SAG operations.

While this is quite powerful, it's very rarely used and can already be done by combining two vbcompress instructions with a single vslideup instruction.

I would only consider this worth adding if a hardware implementation of vbcompress and vbexpand gets this operation basically for free. This doesn't seem to be the case, here adding SAG increases the area by 50% over just implementing vbcompress/vbexpand`.

Going with this 50% area increase, I'd rather hardware designers use these resources to make vbcompress/vbexpand` faster than adding a dedicated SAG instruction.

Bit-matrix operations (`vbmatxor` / `vgf2p8affineqb` / `MXOR`)

This instruction is present in Knuth's MMIX as MXOR and modern x86 processors with support for GFNI as VGF2P8AFFINEQB. It was also considered for the RISC-V scalar bit manipulation extensions but didn't end up in the final specification.

The instruction interprets a 64-bit number as a 8x8 bit matrix, and multiplies two of such 8x8 bit-matrices together, replacing addition with XOR and multiplication with AND:

Assume that the 64 bits in $Y are numbered as follows (and similar for the bits in register $Z and $X): Y₀₀,Y₀₁...Y₀₇,Y₁₀,Y₁₁...Y₁₇,...,Y₇₀,Y₇₁...Y₇₇

Now bit Xᵢⱼ in register register $X is computed as follows: MXOR: Xᵢⱼ = Y₀ⱼ & Zᵢ₀ ^ Y₁ⱼ & Zᵢ₁ ^ ... ^ Y₇ⱼ & Zᵢ₇

This might not immediately sound useful, but it has a lot of unexpected uses.

Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte (Why Ice Lake is Important (a bit-basher’s perspective)")

Although creating the constants needed for your permutation is non-trivial, the x86 vgf2p8affineqb instruction still sees a lot of real-world use.

Prominent examples include: Arbitrary Modular GF(2ʷ) Multiplication, RAID error correction, PAR2, SM4, emulating missing 8-bit AVX instructions, sub-byte arithmetic, very fast SIMD byte histogram, 8x8 and 16x16 bit matrix transpose, indices to set bits

It can be used to emulate a bunch of 8-bit instructions that already exist in RVV, which may seem redundant. However, if this instruction was added to RVV it should operate on EEW=64 EMUL=LMUL, which would allow for doing the 8-bit operations without a vtype change, removing up to two vsetvlis from certain code paths.

The hardware implementation cost seems manageable.

Claire Wolf shared the following on the isa-dev mailing list:

The HW cost for bmat[x]or is not as big as one might expect. My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.

"Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications" also covers this instruction and their single cycle implementation takes 2.4K NAND gates.

Existing vgf2p8affineqb implementations corroborate this:

vgf2p8affineqb ymm, ymm, ymm, i8	icelake	tigerlake	rocketlake	alderlake-E	alderlake-P	Zen4
Latency	5	5	5		5	3
RThroughput	0.5	0.5	0.5	0.5	0.5	0.5

Due to the versatility and reasonable hardware cost, I suggest adding the vbmatxor.vv and vbmatxor.vx instructions. The .vx variant is included, because most uses of vgf2p8affineqb broadcast a 64-bit constant to perform the same 8-bit permutation on all bytes.

Scan operations with segmented variants

Scan operations, also called prefix operations, are reductions that write back the intermediate results, effectively computing all partial reductions.

RVV already has a few "hidden" scan instructions that operate on mask registers:

vmsbf(~m) is an and-scan on mask-bits
vmsif(m) is an or-scan on mask-bits
viota(m) is an +-scan on mask-bits that is widened to SEW

The case for segmented scans

Scan operations on vector elements and segmented variants are very important missing instructions because they allow you to efficiently parallelize an entirely new paradigm of problems. Segmented scans, scan multiple segments of elements separately, as encoded by a mask. A set bit in the mask indicates that a new segment starts at this position. Implementing this using existing RVV instructions is very expensive.

Segmented scans allow you to efficiently process multiple variable-length sequences of input in parallel. This is best illustrated by an example.

Consider vectorizing run-length encoding, that is, turning an input of bytes into pairs of bytes encoding a repetition count and a byte value to repeat. With the classical SIMD/vector model, you can only process one "run" at a time, even though a vector register might contain multiple runs. While this isn't a big problem for smaller vector lengths, it can be a huge source of inefficiency when dealing with longer vectors. Using a segmented max-scan allows you to process all the runs in a vector register independently:

RLE using +-scan:
0: 8 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 | v0 as input
1: 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 9 | v1 = slide1up(v0, v0[0]+1)
2: 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 | m0 = v0 != v1     __
3: 0 e 0 0 b 0 9 8 0 0 0 0 3 2 0 0 | v2 = m0 ? vid : 0   \
4: e e b b b 9 9 8 3 3 3 3 3 2 0 0 | v3 = +-scan(v2, 0)  | does seg-viota(m0)
5: 1 0 2 1 0 1 0 0 4 3 2 1 0 0 1 0 | v4 = vid - v3     __/
6: 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 | m1 = m0 >> 1
7: 7 8 7 8 7 8 | val     = compress(v0, m1)
8: 2 1 0 4 0 1 | repeats = compress(v4, m1)

Notice how the above code uses a max-scan to implement a segmented viota. This works for both seg-viota and seg-+-scan:

seg-+-scan:
 6  5  4  3  2  1 | v0
 1  0  1  0  0  1 | m0                      |   seg-viota:
 0  1  0  1  0  0 | m1 = m0 >> 1            |   0 0 1 0 0 1 1 0 0 0 0 1 | m
21 15 10  6  3  1 | v1 = +-scan(v0)         |   0 0 9 0 0 6 5 0 0 0 0 0 | v1 = m ? vid : 0
 0  0  0  0 15  6 | v2 = vcompress(m0,v1)   |   9 9 9 6 6 6 5 0 0 0 0 0 | v2 = +-scan(v1,0)
 0  0  0 15  6  0 | v3 = vslide1up(v2,0)    |   2 1 0 2 1 0 0 4 3 2 1 0 | seg-viota = vid - v2
 2  1  1  0  0  0 | v4 = viota(m1)
15  6  6  0  0  0 | v5 = vrgather(v3,v4)
 6  9  4  6  3  1 | seg-+-scan(v0,m0) = v1 - v5

AVX512 can also decode multiple runs in parallel using the more complex vpconflict instruction from the AVX512CD extension:

0: v0 as input
1: cfss    = _m512_conflict_epi32(v0)
2: lzcnt   = _m512_lzcnt_epi32(cfss)
3: m       = _mm512_movepi32_mask(cfss)
4: scfss   = _m512_sllv_epi32(cfss, lzcnt)
5: nscfss  = _m512_andnot_epi32(scfss, ~0)
6: rl      = _m512_lzcnt_epi32(nscfss)
7: repeats = _mm512_maskz_compress_epi32(m, rl)
8: val     = _mm512_maskz_compress_epi32(m, v0)

This took the same instruction count as the +-scan variant but only works for 32 and 64-bit elements.

Falvuy noted that a slightly different storage format can be vectorized to handle multiple runs at ones using the element compress instruction, as described in this paper.

An example that AFAIK can't be efficiently implemented in AVX512 is parsing a series of integer numbers, in this case up to 16-bit, in parallel. The best x86 implementation I know of uses a permutation lookup table with 2¹⁶ entries to distribute elements of a 128-bit vector into power-of-two-sized chunks, such that they can be summed using multiple horizontal adjacent pair addition instructions.

Using seg-+-scan this becomes possible for arbitrary VLEN, here is a simplified implementation:

   _    5    6    7    8    _    5    6    _    5    6    7    _  | v0 = vsub(input, '0')
   _    7    6    5    _    6    5    _    8    7    6    5    _  | v1 = vrgather(v0, reverse)
   1    0    0    0    1    0    0    1    0    0    0    0    1  | m0 = vmsgt(v1, 9)
   0    3    2    1    0    2    1    0    4    3    2    1    0  | v2 = seg-viota(m0)
   0  100   10    1    0   10    1    0 1000  100   10    1    0  | v3 = vrgather(pow10tbl, v1)
   0  700   60    5    0   60    5    0 8000  700   60    5    0  | v4 = vwmul.wv(v3, v1)
   0  765   65    5    0   65    5    0 8765  765   65    5    0  | v5 = seg-+-scan(m0, v4)
   0    1    0    0    0    1    0    0    1    0    0    0    0  | m1 = m0 >> 1
   765 65 8765 | v6 = vcompress(v5, m1)

This implementation directly widens the elements to 16-bit. An optimized implementation should in place combine and widen adjacent elements (A*10+B), using a seg2 load and vwmul.

There are a lot of problems that benefit from scans and segmented scans: Line-of-Sight Problem, prefix sum, average word length, sparse matrix-vector multiplication (SpMV), sum of values from neighbors in a graph, Minimum Spanning Tree, Maximum Flow, ...

Note that seg-copy-scan or segment-broadcast used in some of the linked problems is just trivially compress + viota + vrgather.

"Vector Models for Data-Parallel Computing" covers even more use cases, if one were to include segmented split (element-wise Sheep-and-goats), which is probably too expensive to implement in practice. Segmented split can be implemented in <15 instructions using segmented viota and a lot more than that without scan instructions.

To reiterate, segmented scans allow you to efficiently vectorize new kinds of problems and are especially important to get good utilization on larger vector registers.

New instructions are only useful if they are implementable at reasonable speed and implementation cost. I'd like to argue that implementing scan counterparts to existing reduction instructions would not provide a significant hardware challenge.

The first half of the work-efficient parallel scan implementation already matches the reduction tree used in reductions. The second half is just a mirrored version of the up-sweep with offset inputs and outputs.

Segmented variants are conceptually a bit more complicated, but don't require much additional hardware. You need to propagate a single additional bit, from the segment starts, through the scan tree, which pins the value, such that subsequent steps don't change it. (see above)

I was not able to find any reports on hardware implementations, especially ones on how difficult it is to reuse reduction circuitry to implement scans would be insightful.

One software implementation reported that implementing segmented variants on top of existing scan instructions is quite straightforward:

On the CRAY Y-MP, the asymptotic running time of the plus-scan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the un-segmented versions. (source)

Assuming additional implementation complexity of segmented scan versions of existing reductions is reasonable, such instructions should be added to RVV, because they allow vectorization of new kinds of problems, while efficiently utilizing even very long vector registers.

Possible instruction proposals

While the reduction instructions take a second register source for the initial value, I'd recommend dropping this for the vector scans to conserve encoding space. Similarly, I wouldn't add separate instructions for segmented scans but rather define the masked variants of scans to work as segmented scans. AFAIK, masked scans and masked segmented scans are rare and can be implemented with additional instructions if needed.

As a minimal set of scan instructions, I would propose the following:

+-scan/unsigned-max-scan: vscansum.v vd, vs1, vscanmaxu.v vd, vs1
segmented +-scan/unsigned-max-scan: vscansum.v vd, vs1, vm, vscanmaxu.v vd, vs1, vm
segmented viota: viotar.m, the "r" for resetting, which described the behavior: it counts up until a mask bit is set, at which point it resets to zero and continues

A full set of scan instructions corresponding to existing reductions would include the following additions:

unsigned min-scan: vscanminu.v vd, vs1
signed min/max-scan: vscanmin.v vd, vs1, vscanmax.v vd, vs1
floating-point +-scan: vfscanusum.v vd, vs1
including segmented versions: vscanminu.v vd, vs1, vm, vscanmin.v vd, vs1, vm, vscanmax.v vd, vs1, vm, vfscanusum.v vd, vs1, vm

It could also include element-wise and/or/xor scans, but I don't see a need for these bitwise scans.

and-scan and or-scan already exist on mask registers, where they are actually useful:

vmsbf(~m) is an and-scan on mask-bits
vmsif(m) is an or-scan on mask-bits

Coincidentally, the next section will argue for a xor-scan on mask bits.

xor-scan/prefix-xor for masks

This operation is often used in SIMD parsing to match delimiters and is equivalent to carryless multiply with -1.

Usage: simdjson, json link simdzone, simdcsv

It can be emulated using the following instruction sequence:

# (xor_scan(00100100)>>1) == 00011100, it's offset by one
vsetvli t0, x0, e8, m1, ta, mu # requires `EMUL=LMUL*(8/SEW)`
viota.m  v1, v0
vand.vi  v1, v1, 1
vmseq.vi v0, v1, 1

The above is three LMUL=1 instructions when LMUL*(8/SEW)≤1, but you may need to change the surrounding code to accommodate the offset when porting existing code.

The following variant will likely be faster when LMUL*(8/SEW)>2:

# xor_scan(00100100) == 00011100
li t0, -1
vsetvli t0, x0, e64, m1, ta, mu
vclmul.vx v1, v0, a1 # 64-bit parallel-prefix XOR
vmslt.vi  v0, v1, 0 # top bit set
viota.m   v2, v0 # parallel-prefix XOR of top bits
vand.vi   v2, v2, 1
vmseq.vi  v0, v2, 1
vnot.v    v1, v1, v0.t # apply flip

Expected RThroughput for SEW=8 from SiFive P670 llvm-mca model:

method	m1	m2	m4	m8
dedicated instruction	1	1	1	1
viota	3	4	8	16
vclmul+viota	6	6	6	6

Vector length-specific code could have a specialized VLEN≤512 path, that uses a single vclmul.vx to do the prefix-xor at LUML=1 in a single instruction.

The current situation isn't bad however you get the implementation almost for free if your processor already implements vclmul.

Because it's so cheap to implement, uses minimal opcode space and to encourage VLA code, it would be advantageous to add this as a dedicated instruction.

Notes on naming:

While vmxorscan would be a possible name, it doesn't quite fit with the existing mask instructions:

or-scan is vmsif(m) "set-including-first mask bit"
and-scan is vmsbf(~m) "set-before-first mask bit"

To harmonize the names, vmsxff for "set-xor-from-first mask bit" might be a better alternative.

I'm not sure if they are needed, but they would be nice to have

Mask and element reversal

Reversing elements and mask-bits isn't a common operation, and there are already performance-portable ways of implementing them for application processors:

element-reverse:
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid
# To reverse LMUL>1 vrgather at LMUL=1 seperatly and use moves to swap LMUL=1
# vector registers

mask-reverse:
vbrev.v v0, v0
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid

However, there are a few caveats that make dedicated instructions attractive.

Firstly, both of these reversals require vl=vlmax and require an additional vslidedown when vl<vlmax.
Secondly, both approaches don't work for SEW=8 with VL>256, and require vrgatherei16 and 16-bit indices instead, which often performs worse.
Thirdly, while using vrgather to reverse elements will perform well on sane application class processors, but on other targets, SEW=8 vrgather performs significantly worse than SEW=64 vrgather. In which case, you should probably use a SEW=64 vrgather with an additional vbrev8, when reversing a SEW=8 vector.

While I don't think this is a very common operation, a dedicated reverse instruction would make code using it easier to write and potentially faster.

Both element and mask-reversal instructions do not take a lot of encoding, as they have only one source and one destination register.

Dedicated within 128-bit lane shuffle / 16x8b LUT

For string algorithms, a fast 16x8b LUT is non-negotiable, and a lot of other integer workloads also require it or generally fast shuffles within 128-bit lanes. This is currently addressed by an LMUL=1 vrgather.vv, however scaling it to larger vector length seems impractical as complexity grows quadratically. If you have the silicon budget for a 256-bit lane vrgather.vv primitive, but a vector length of 1024, then a naive hardware implementation would need to apply the primitive (1024/256)^2=16 times.

We see this impacting existing implementations. The SiFive X280 has VLEN=512 and a native 256-wide data path, but has vl-cycles occupancy for LMUL >= 1, and a special case ~1-cycle occupancy for LMUL ≤ 1/2, the data path width. (See [issue thread](It riscvarchive/riscv-v-spec#929 (comment)))

There is a silver lining however as an implementation could detect when all indices in a lane point to the same other lane, and only execute the primitive once for that lane. This would improve the above example to 4 applications of the primitive for a lot of common cases including the 16x8b LUT, any in-lane permutation, element reversals, ...

I'd imagine that high-performance out-of-order cores would have trouble implementing this optimization, but I don't think it's as relevant for those cores. They are unlikely to use very long vectors and will have the budget to implement a full LMUL=1 vrgather.vv primitive. See existing x86 and ARM cores.

On the other hand, the high-performance out-of-order P470 and P670 cores from SiFive seem to implement this optimization for LMUL>1 vrgathers, where this doesn't even matter as much because software can just unroll into multiple LMUL=1 vrgathers if there is no lane crossing.

Nonetheless, we see in existing AVX512 implementations, that the latency of a in-lane gather is smaller than using a full lane-crossing vrgather:

latency for zmm	icelake	tigerlake	rocketlake	alderlake-P	Zen4
vpshufb	1	1	1	1	2
vpermb	3	3	3	3	6

Additionally, while SVE originally didn't include 128-bit lane permute, SVE2.1 recently added it as TBLQ.

A separate instruction may simplify implementations, but it would also be an incentive to neglect the LMUL=1 vrgather.vv implementation, which is also an important operation. This should be mostly mitigated because the base V extension only has the full lane-crossing-gather, hence a lot of software will use lane-crossing operations and force vendors to implement it competitively.

So I'm not sure if a separate instruction is needed, but it seems like it would help simplify implementations and enable more performance portable RVV code.

Possible instruction proposals

A) gather within 128-bit lanes, indices are in the corresponding lane of the index vector register

pros
- allows for different shuffles per lane, this seems to be a rare use case however
- index lanes are next to data lanes, so there may be less data movement in certain architectures.
cons
- requires an additional 128-bit element broadcast instruction to be more easily useful, such an instruction does have general utility though
- increases register pressure, as a LMUL>1 gather would require an LMUL>1 index register

B) gather within 128-bit lanes, indices are in the lowest 128-bit of the index register, potentially always as 8-bit

pros:
- no 128-bit broadcast needed, thus easier programming model
- only uses a LMUL=1 index register
- with 8-bit indices, shuffles for e16 and above can be loaded from a single 64-bit GPR (.vx variant possible, although the semantics would be weird for e8)
cons:
- Use case of different shuffles per lane not supported
- may require more data movement and increase latency compared to indices in the corresponding lanes

I don't think shuffles from immediate (like vpshufd) are needed to reduce vector register pressure since most other instructions have immediate and GPR variants. It would also require an 8-bit immediate, which would take a lot of encoding space. As such, this would only make sense as a destructive operation using the 5-bit immediate and additional 5-bits from the rs1 encoding. This would use two more bits than necessary, which could either be repurposed as additional opcode space or directly encode the EEW.

Ternary logic

AVX512 introduced the very powerful vpternlogd/vpternlogq instructions, which use an 8-bit immediate to select an arbitrary ternary logic operation.

They can express the following bitwise operations in a single instruction:

C xor A xor B
(A orn B) and C
C ? A : B bitwise merge
C ? A nor B : A xor B exactly one bit set
C ? A xor B : A and B exactly two bits set
Used in MD5: C ? A : B, A xor B xor C, A xor (B orn C)
Used in SHA-1: C ? A : B, A xor B xor C, C ? A or B : A and B
Used in SHA-2: C ? A : B, C ? A or B : A and B

RVV defines all possible binary operations for mask registers (and, andn, nand, nor, or, orn, xnor, xor), but only and, or, xor, andn for vector elements. The binary operations on vector elements do have a GPR and immediate variant, which can be quite useful.

While a fully generic ternary logic instruction would be a great design point, I don't think it would fit into the current ISA very well, as the existing logic instructions could've been defined on top of it.

Using the existing RVV binary mask operations it takes up to four instructions to simulate what vpternlog can express in a single instruction. That isn't too bad, but adding a ternary conditional selection instruction (C ? A : B) would bring it down to three or fewer instructions with a maximum dependency chain length of two instructions.

I think it would be reasonable to add the missing binary operations for vector elements, as well as a ternary conditional selection (C ? A : B), which should also have a variant for mask operations. The missing binary operations don't really need an immediate variant, and I'd be happy with just defining vector-vector variants to preserve opcode space.

I've placed this under "not sure if needed", because I'm not sure how much real world difference additional instructions would make. I'm also unsure how to quantify the impact of adding a three source instruction to the mask unit, which previously only had two source operand instructions.

Probably not needed

Obviously, this won't cover things instructions are directly available. Instead, it will cover instructions present in other ISAs, that I don't see as needed.

all-to-all comparisons

Both AVX512 and SVE provide instructions that do all-to-all element comparisons for finding and counting equal elements.

AVX512 has vp2intersect and vpconflict, which both do an all-to-all comparison, with 32 and 64-bit variants. SVE has MATCH and HISTSEG for 8 and 16-bit elements, which do an all-to-all comparison within every 128-bit lane, and HISTCNT which does a lane-crossing all-to-all comparison for 32 and 64-bit elements.

Historically vpconflictd and vp2intersect were famously quite slow on Intel:

Latency/RThroughput	all intel, icelake to alderlake	Zen4	Zen5
vpconflictd ymm, ymm	17/9	6/0.67	6.9/1
vpconflictd zmm, zmm	26/19.5	7/1.33	6.9/1
vp2intersectd ymm, ymm	27/11 (tiger/alderlake only)	6/0.67	?/1
vp2intersectd zmm, zmm	42/23 (tiger/alderlake only)	---	?/1

The performance of vp2intersect on Intel was so bad, that a faster-than-native alternative software implementation was found. Intel deprecated VP2Intersect from 2023 onward, while AMD re-introduced it on Zen5 with a very performant implementation.

The SVE instructions don't have this problem, presumably because the vector lengths are all just 128-bits.

Latency/RThroughput	Cortex A510	Cortex A520	Neoverse N2	Neoverse V2/V3	Cortex A710	Cortex X2/X3/X4	Cortex A715/A720
MATCH	7/4	7/4	2/1	2/1	2/1	2/1	2/0.5
HISTSEG	8/5	8/2.5	2/0.5	2/0.25	2/0.5	2/0.25	2/0.5

I could find almost no usage of the SVE instructions, while vpconflict and vp2intersect were slightly more common.

If RVV was to add such an instruction, it would need to follow SVE in restricting them within lanes, otherwise, implementations of a certain VLEN are just not reasonably possible, due to the quadratic growth of the needed comparisons. This restriction makes the instructions a lot less nice to use and, as I'll argue in the following, not needed to suitably address the most common use cases of these instructions.

Histogram

By far the most common application of these instructions is histogram computation/data binning. This involves filling a vector with indices of buckets that need to be incremented, detecting conflicts, resolve them, and finally gather+increment+scatter the values to the buckets.

Intel advertised vpconflict explicitly for allowing their compiler to auto-vectorize histogram computation. Currently, only vendor compilers are capable of this and only when the arguments are marked as restrict.

The biggest benefit of such vectorizations isn't the histogram increment itself, rather the vectorization of the rest of the calculation. Auto-vectorizing compilers could just vectorize the loop, up until the histogram increment, at which point it could iterate over the vector register entries and increment the histogram using scalar code.

Furthermore, when measured, using vpconflict doesn't seem beneficial for histograms when measured (Zen5 results, benchmark code) and even outperformed by scalar code, extremely so on AMD. This was not the case on current RVV implementations, presumably because they have very weak in-order scalar execution compared to the tested x86 processors. The out-of-order execution resources of today's processors seem to already do a great job at executing load/stores in parallel while avoiding conflicts.

There are many ways to efficiently vectorize SIMD creation without a dedicated conflict detection instruction, which cover almost all use cases:

Small histograms can benefit greatly by hand-tuning for the specific size: hand-tuned byte histogram without vpconflict, see the "aptroot" line in the graph below
On X86 and presumably, other highly aggressive out-of-order cores, using scalar code to increment the histogram values is faster than using memory gather/scatter.
The histogram is small enough to replicate it VL times, forgoing the need for conflict detection entirely.
The histogram is too large to replicate, consequently, conflicts are rare. In this case, you can quickly check if a conflict happened by doing an additional gather after the scatter write-back of incremented values to check if the difference between the original values is exactly equal to vl. If a conflict occurs, then conflicting buckets will only be incremented once, and you'll be able to tell that there weren't vl increments as expected.
Finally, conflict detection can always be implemented with N/2 iterations of existing instructions: https://riscv.godbolt.org/z/rEY6oM6ne

Due to the above alternative implementation strategies covering the histogram use case almost entirely, I don't see it as a good motivation for a dedicated instruction.

Set intersection

The other slightly bigger use case for these all-to-all comparison instructions is computing set intersections and other set operations that follow from that.

I could find two libraries that used these instructions to implement set intersection, tetzank/SIMDSetOperations ashvardanian/SimSIMD. However, I couldn't find any external code using these set intersection functions.

Let's look at some benchmarks comparing implementations using the dedicated instructions with ones that aren't using them:

SimSIMD set intersection:
- The SVE HISTCNT and MATCH implementations were slightly slower than the equivalent NEON variant on Neoverse V2, which has a 128-bit SVE implementation.
- The AVX512 implementation with vp2intersect was on average only 5% faster then the AVX512 shuffle variant on Zen5. Although, the results varied a lot. The biggest speedup of 1.60x occurred when intersecting 128 with 8292 16-bit elements, while the biggest regression that was 1.25x slower when intersecting 1024 with 1024 16-bit elements.
SIMDSetOperations set intersection:
- On Skylake-X: The vpconflict implementation was 1.6x slower than the AVX2 shuffle implementation.
- Zen5: The vp2intersect implementation was 1.5x faster than the AVX2 shuffle implementation, while the vpconflict one was 1.2x slower.
- Pretending we have a fast vp2intersect on other processors: Since I don't have access to a CPU with fast vp2intersect I modified the vp2intersect implementation to use AVX2 and replaced the vp2intersect instruction with a sequence of instructions that have the same latency, as vp2intersect has on Zen5, so 7 cycles. Since AVX2 doesn't have a compressed store, I replaced it with a regular store. This will produce the wrong result, but should give an indication of the expected performance if a fast vp2intersect instruction was available.
  - Skylake-X: fake-vp2intersect was ~1.35x faster
  - Zen1: fake-vp2intersect was ~1.1x faster
  - Coffe Lake: fake-vp2intersect was ~1.3x faster

That's not a large improvement for an instruction of quadratic implementation complexity.

The SIMDSetOperations benchmark intersected two large, equal-sized, sets with ~1/3 intersecting elements. Often this is not the case and one set is a lot larger than the other, e.g. in keyword searches and graph analytics. (source)

For intersecting very large sets with smaller ones, there are alternative algorithms that don't benefit much from such instructions. This paper on fast SIMD set intersection proposes precomputing a bloom filter-like structure to represent the large set. It then uses specialized NxM intersection kernels to resolve matches more efficiently:

This approach could be expanded by batching multiple NxM intersections with the same dimensions and resolving them using a VL/K MxN intersection kernel. For instance, the following code could resolve VLEN/64 2x4 intersections of 32-bit elements at once:

# inputs: LMUL=1 v2 # ...BAbaBAba
#         LMUL=2 v4 # ...DCBAdcbaDCBAdcba
# output: v0 # intersecting elements
vsetvli x0, a0, e32, m1
vtrn1.vv v6, v2, v2 # ...AAaaAAaa
vtrn2.vv v8, v2, v2 # ...BBbbBBbb
vwzip.vv v10, v6, v6  # ...AAAAaaaaAAAAaaaa
vwzip.vv v12, v8, v8  # ...BBBBbbbbBBBBbbbb
vsetvli x0, a1, e32, m2
vmseq.vv v0, v4, v10
vmseq.vv v1, v4, v12
vmor.mm v2, v0, v1
vcompress.vm v0, v4, v2
# Note: I'm using vtrn/vwzip as proposed earlier.
#       It could also be implemented two vrgathers instead,
#       but for that you'd want to create the indices outside of the main loop.

This approach allows you to get almost full utilization of your vector length.

A dedicated full-register intersection instruction wouldn't be that beneficial, because the sub-intersections are usually small. What could be beneficial, is a lot of dedicated instructions for some of the larger or harder-to-implement kernels, e.g. for multiple 3x4, 3x5, 4x5, ... intersections.

other uses

The user uses I could identify were:

sparse vector dot product, is functionally equivalent to set intersection,
Run-length encoding, can more efficiently be solved with segmented scans, see above.
there were a handful of uses where vpconflict was directly followed by memory gather/scatter, but I wasn't clearly able to tell if it's a histogram in disguise. Still, in these cases the memory accesses are likely going to be the bottleneck.

The implementation cost of dedicated intersection/conflict detection instructions is very high, and the benefits aren't clear. Since the most common applications can be efficiently vectorized using other methods, I don't consider this good candidate instructions for general-purpose application class processors.

If vendors see a need for dedicated instructions similar to the ones discussed above for specific applications, e.g. database applications, then it's certainly worth considering. Such applications would probably benefit from more specialized instructions that match the exact usage of the application. In the intersection algorithm described above, that may be many instructions that do multiple small intersections of different sizes in parallel.

Please share feedback and contribute suggestions

Feel free to share feedback and suggest additions and alternative implementations.

Advanced SIMD instructions usage survey notes

Expand me

Arbitrary bit-permutation
- bitshuffle_epi64_mask / vpshufbitqmb
  - arbitrary bit-permute of GPR showcase
  - bit-permute mask registers, dav1d
  - used in finite state machine
  - bit-interleaves on GPRs, can be done via 2-3 pdep/pext (bit compress/expand), see BMI implementation
- Additional notes: RVV can already do bit-permutations via vrgather on bytes converted from/to a mask register. Using a single LMUL=1 vrgather matches the bits you can bitshuffle on AVX512/VL on the same VLEN.
Bit-matrix operations
- gf2p8affine_epi64_epi8 / vgf2p8affineqb / bmatxor
  - RISC-V BitManip v0.93 wiki notes on bit-matrix operations (bmatxor, bmator, bitmatflip):
    - bit permutation (within bytes), byte permutation, bit-duplication (within bytes), byte duplication
    - many xor / many or (think "vector lite")
    - full NxM bit matrix multiply (using many 8x8 ops)
    - searching (finding zero bytes in 8-byte chunk)
    - linear algebra in GF(2)
    - arithmetic in GF(2k) with k ≤ 8
  - "Unexpected Uses for the Galois Field Affine Transformation Instruction"
  - Why Ice Lake is Important (a bit-basher’s perspective) "Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte"
  - "Use AVX512 Galois field affine transformation for bit-shuffling" "Arbitrary reshuffle bits within a byte. We may reverse bits, rotate, set the order of bits, broadcast selected bit(s), etc." "Gather in a byte selected bit from a lane."
  - used in RAID "GFNI support for amd64, for up to 3x faster processing.", only uses gf2p8affine_epi64_epi8 from GFNI
  - used in PAR2, for GF16 arithmetic
  - SM4 sBox, RISC-V has SM4 support in vector crypto
  - very fast SIMD byte histogram
  - turn indices into set-bits, RVV could widen indices to 64-bit, do LMUL=8 1 << idx and do LMUL=8 vredxor
  - broadcast selected bit in byte
  - bit interleave in byte
  - swap even/odd bits
  - 8x8 bit matrix transpose
  - 16x16 bit matrix transpose
  - find first byte occurrence in 64-bit lane
  - broadcast imm8
  - Arbitrary Modular GF(2w) Multiplication
  - Fixed 2-bit Packed Arithmetic
  - reverse bits in byte, RVV has vbrev
  - 8-bit shift/rotate, relevant article, exists in RVV
  - full vreg prefix xor, RVV can use vclmul+viota
  - additional use-cases
  - 8-bit left shift + add, RVV has vfmacc/vfmadd
  - bitwise NOT, RVV has vnot
  - count trailing/leading zeros, RVV has vclz/vctz
- Additional notes:
  - "Advanced SIMD instructions usage survey notes" covers this instruction
  - The immediate argument in vgf2p8affineqb is almost always zero. It's just an additional XOR applied to every byte so shouldn't be included in a vbmatxor RVV instruction.
  - isa-dev messag from Claire Wolf "The HW cost for bmat[x]or is not as big as one might expect. My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders."
  - It can be used to emulate a bunch of 8-bit instructions that already exists in RVV, which may seem redundant. However, if this was added to RVV it should operate on EEW=64 EMUL=LMUL, which would allow for doing the 8-bit operations without a vtype change.
All-to-all equality comparisons, detecting duplicates
- conflict_epi16/32 / vpconflict
  - SIMDSetOperations
    - set intersection
    - set difference
    - set union, code path not active, uses alternative implementation
  - sparse vector dot product, functionally equivalent to set intersection
  - run-length encoding multiple runs in parallel, can also be done via segmented viota or max-scan
  - factorization: determine unique indices, followed by memory gather/scatter
  - unique for histogram benchmark, followed by memory gather/scatter
  - LAMMPS Molecular Dynamics Simulator, followed by memory gather/scatter
  - flux calculation, followed by memory gather/scatter
  - linear probing, followed by memory gather/scatter
- 2intersect_epi32/64 / vp2intersect
  - "Faster-Than-Native Alternatives for x86 VP2INTERSECT Instructions"
  - set intersection, I don't have benchmark numbers yet, however emulated vp2intersect gave a good speedup for some input sizes
  - set intersection
  - sum of sparse arrays, uses emulated vp2intersect to find matching indices
- svmatch / svhistseg
  - digit count (VLEN=128 only), just do vcpop(x-'0' < 10), it's used here due to a quirk of the Neoverse V2
  - set intersection, slightly slower than NEON with same VLEN
  - count bytes equal to 0,-2,-1,1,2 (VLEN=128 only), RVV could use input as indices to gather 32-bit values and sum them using SWAR, probably other things possible
- svhistcnt
  - set intersection, slightly slower than NEON with same VLEN
- Additional notes:
  - vpconflict doesn't seem beneficial for histograms when measured, zen 5 results, benchmark code
  - Paper on fast SIMD set intersection using an alternative set format and specialized NxM intersection kernels, without using the dedicated conflict/intersection instructions
Expand/Decompress vector elements
- expand_epi16/32/64 / vpexpand / vpdecompress
  - decode packed varuint format, expand variable, determined by encoding, 1-4 bytes into 4 bytes
  - Access Control List
  - vertexcodec
  - simd-sort swap
  - selectively widen with default value (latin1 to utf8, yenc), could be widen+compress or viota+vrgather
  - vslideup in simd-sort partition
  - vslideup in utf8 conversion
  - widen u32 to u64
  - Intel QLP expand primitive, I wasn't able to find any third party code using this
- svsplice
  - combine two vectors, vslideup, often you can just vcreate
- Additional notes:
  - The spec shows how to synthesize a decompress operation via viota+vrgather
Extract 8 unaligned bytes from u64
- multishift_epi64 / vpmultishiftqb
  - base64 encode (WojciechMula/base64simd, php/php-src, aklomp/base64): gather [........|ccdddddd|bbbbcccc|aaaaaabb] -> multishift [bbbbcccc|ccdddddd|aaaaaabb|bbbbcccc] -> [00dddddd|00cccccc|00bbbbbb|00aaaaaa] could also be done via a bit-compress
  - decompress example from intel optimization guide, could also be done via a bit-compress
  - v210 to rfc4175 (v210 without padding), uses multishift+shuffle+multishift+and+andnot+or+compress, could presumably also be done with bit-compress+widen+compress+narrow
  - latin1 to utf8, fast alternative implementations available
  - utf8 validation, fast alternative implementations available
  - padding bit-matrix with ≤7 bit rows to 8-bit rows, could also be done via a bit-compress
  - utf16 to utf8, fast alternative implementations available
  - widen 4 to 8-bit, could be done with something like vsrl_u8(vwmul_u8(x, 16), id&1 ? 0 : 4)
bit compress and expand
- pext / bext / bcompress
  - base64 decode, base64simd vbrev8 [00dddddd|00cccccc|00bbbbbb|00aaaaaa] -> vbcompress [00aaaaaa|00bbbbbb|00cccccc|00dddddd] -> vrgather [........|aaaaaabb|bbbbcccc|ccdddddd] -> [........|ccdddddd|bbbbcccc|aaaaaabb]
  - LEB128/varint decode, used in DWARF, protobuf, WebAssembly, ...
  - regex matching: heavy usage in hyperscan/vectorscan
  - genomics: claimed in "Introduction to SVE2", de/en-coding of nucleotides
  - Board games:
  - UTF-8 decode in GPR
  - Picnic: Post-Quantum Signatures
  - Adaptive Linearized Tensor Order format
  - Space-filling curves: 2D Morton Z-order to Cartesian coordinates, Hilbert curve
  - bit-deinterleave: CBQN
  - hex to ASCII
  - find indices of set bits
  - narrow: 8/16 bit to packed bits,
  - pack data structure
  - align previously created mask with elements after compression
  - convert compression mask to mask of compressed elements, probably not needed, since RVV has vl predication
- pdep / bdep / bexpand
  - base64 encode, base64simd vrgather [........|ccdddddd|bbbbcccc|aaaaaabb] -> vbexpand [........|aaaaaabb|bbbbcccc|ccdddddd] -> vbrev8 [00aaaaaa|00bbbbbb|00cccccc|00dddddd] -> [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
  - LEB128/varint encode, used in DWARF, protobuf, WebAssembly, ...
  - regex matching: heavy usage in the hyperscan/vectorscan
  - genomics: claimed in "Introduction to SVE2", de/en-coding of nucleotides
  - Board games:
    - Chinese Checkers
    - Japanese Chess board encoding
  - UTF-8 encode in GPR
  - bit striping rgb pixel data
  - Quantum circuit simulator
  - Space-filling curves: Cartesian to 2D Morton Z-order
  - bit-interleave: CBQN
  - index of nth set bit within element
  - unpack sequence of packed 1/2/3/4/5-bit values
  - unpack data structure
  - rank select, mask n'th least-significant set bit
  - limit mask to 32 set bits
  - widen: CBQN (conditionally once or twice), widen vector mask,
- bgrp / Sheep-and-goats
  - arbitrary bit permutation in ceil(log2(nbits)) Sheep-and-goats (SAG) operations
  - Othello evaluation function
- Additional notes:
  - 8 μops per set bit on Zen1/2
  - "RISC-V Extensions for Bit Manipulation Instructions" paper:
    - bcompress/bexpand took about 3x more gates than a barrel shifter (3.3K vs 0.9K Gates)

camel-cdr/rvv-gap.md

RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

Proposal TLDR

Should be added to RVV

Slides for mask registers

Element transpose (vtrn1/vtrn2) and zip (vwzip) instructions

Missing bit permutations

Bit compress and expand (`vbcompress` / `vbexpand`)

Aside: No need for Sheep-and-goats

Bit-matrix operations (`vbmatxor` / `vgf2p8affineqb` / `MXOR`)

Scan operations with segmented variants

The case for segmented scans

Possible instruction proposals

xor-scan/prefix-xor for masks

I'm not sure if they are needed, but they would be nice to have

Mask and element reversal

Dedicated within 128-bit lane shuffle / 16x8b LUT

Possible instruction proposals

Ternary logic

Probably not needed

all-to-all comparisons

Histogram

Set intersection

other uses

Please share feedback and contribute suggestions

Advanced SIMD instructions usage survey notes

nibrunie commented Nov 10, 2024

Uh oh!

nibrunie commented Nov 10, 2024

Uh oh!

camel-cdr commented Nov 10, 2024

Uh oh!

drom commented Nov 11, 2024

Uh oh!

camel-cdr commented Nov 11, 2024

Uh oh!

ubc-guy commented Dec 7, 2024

Uh oh!

camel-cdr commented Dec 7, 2024 •

edited

Loading

Uh oh!

camel-cdr/rvv-gap.md

RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

Proposal TLDR

Should be added to RVV

Slides for mask registers

Element transpose (vtrn1/vtrn2) and zip (vwzip) instructions

Missing bit permutations

Bit compress and expand (vbcompress / vbexpand)

Aside: No need for Sheep-and-goats

Bit-matrix operations (vbmatxor / vgf2p8affineqb / MXOR)

Scan operations with segmented variants

The case for segmented scans

Possible instruction proposals

xor-scan/prefix-xor for masks

I'm not sure if they are needed, but they would be nice to have

Mask and element reversal

Dedicated within 128-bit lane shuffle / 16x8b LUT

Possible instruction proposals

Ternary logic

Probably not needed

all-to-all comparisons

Histogram

Set intersection

other uses

Please share feedback and contribute suggestions

Advanced SIMD instructions usage survey notes

nibrunie commented Nov 10, 2024

Uh oh!

nibrunie commented Nov 10, 2024

Uh oh!

camel-cdr commented Nov 10, 2024

Uh oh!

drom commented Nov 11, 2024

Uh oh!

camel-cdr commented Nov 11, 2024

Uh oh!

ubc-guy commented Dec 7, 2024

Uh oh!

camel-cdr commented Dec 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bit compress and expand (`vbcompress` / `vbexpand`)

Bit-matrix operations (`vbmatxor` / `vgf2p8affineqb` / `MXOR`)

camel-cdr commented Dec 7, 2024 •

edited

Loading