Here is an overview of which ISA's support these instructions I compiled for an upcoming presentation:
- BESM Soviet mainframes:
- x86-64:
- arm64 / aarch64:
- PowerPC:
- RISC-V:
- Zbe => BDEP & BEXT, but did not make it into ratified B instruction set
As @Eisenwave said, x86-64 performs these operations on general purpose registers, whereas ARM does these in vector registers. PowerPC also uses vector registers. The jury is still out on RISC-V, so I'm not sure what the future holds for them.
I learned from reading through the RISC-V B instruction set discussions that apparently the 'P' in 'PDEP' and 'PEXT', meaning parallel, actually refers to the internal implementation of those operations. Which, I think most people should agree, is not useful to a consumer. I heard people mention it could be implemented on a butterfly network or even more directly than that, it doesn't have to be a butterfly network operation. All this to say, I don't think PDEP & PEXT are good names. BDEP and BEXT sound better to me, although BEXT often refers to extracting a single bit I think. Some names like BitScatter/BitGather or BCOMPRESS/BEXPAND were floated, as analogous to the vector instructions that operate at the byte/word/dword/quad granularity (VPEXPAND{B,W,D,Q} & VPCOMPRESS{B,W,D,Q} on x86, XXGENPCVM + VPERM on PowerPC, VIOTA+VRGATHER & VCOMPRESS on RISC-V, lookup-table-or-BDEP-routine+TBL/TBX & COMPACT on ARM, SPLICE can sometimes be useful on ARM as well).
- succinct "select", i.e. find k'th set bit in a bitstring
- unset k 1 bits
- on x86, one day, I imagine that under certain circumstances a compiler would want to convert some vector logic into SWAR logic, to overcome the latency of moving between vector and general purpose registers. If that ever were to occur, PDEP could fill in code like this:
(I also think I could have used a per-element vector shift, but in context, I used
const byte_indices = comptime switch (builtin.cpu.arch.endian()) { .little => @as(@Vector(8, u8), @splat(1)) << std.simd.iota(u3, 8), .big => @as(@Vector(8, u8), @splat(0x80)) >> std.simd.iota(u3, 8), }; const splatted = @as(@Vector(8, u8), @splat(x)); const selector = (splatted & byte_indices) != byte_indices; const splatted_lsbs: u64 = @bitCast(@select(u8, selector, @splat(1), @splat(0))); // all of the above is equivalent to: pdep(x, 0x0101010101010101)
selector
again for something else, so it made sense in my real code)
- bit-reversing a byte
- concentrating bits to the least-significant bit position that are not contiguous
I also have a little bit of ongoing work on some pext-emulation for when the mask is known at compile time. It's still pretty rudimentary, but it's better than nothing.
https://gist.github.com/Validark/40d2df74b87692fe135bbeac14eed50d
I have a few emulation routines, and have plans to add at least one more. It checks the number of bitgroups, and if sufficiently low, it might lower to regular shifts. I also have one where we use multiplication to concentrate the bits into the upper 64-bits, and then move it back to the LSb. I plan on adding another multiplication routine that would use an extra addition to try to move bits into locations where they are far enough away from other bits that a multiplication would be sufficient (if it was not before). E.g. pext(x, 0x8040201008040201)
would become (((x & 0x8040201008040201) + (0x8080808080808080 - 0x8040201008040201)) * 0x0002040810204081) >> 56
More: