Those, frankly, are a very poor subset of shuffles.
The shuffle* primops exposed by GHC (for example shuffleFloatX4#) take runtime vector arguments that specify shuffle indices dynamically.
They correspond to register-controlled SIMD shuffle operations on x86.
Unfortunately, those are not efficient variants — they’re just the ones that cleanly map onto our current primop representation, and they often don't even exist on older CPUs.
Let’s take a dive through Intel’s “shuffle-like” operations.
Of these, the first group uses compile-time immediates to select shuffle positions, and thus are the ones we’d want to reach eventually, if we knew the constants involved at all.
| Instruction | Width | Control | Description |
|---|---|---|---|
PSHUFD xmm, xmm/m128, imm8 |
128 bit | imm8 | Shuffle 4 × 32-bit lanes arbitrarily |
PSHUFLW xmm, xmm/m128, imm8 |
128 bit | imm8 | Shuffle low 4 × 16-bit words |
PSHUFHW xmm, xmm/m128, imm8 |
128 bit | imm8 | Shuffle high 4 × 16-bit words |
SHUFPS xmm, xmm/m128, imm8 |
128 bit | imm8 | Shuffle packed single-precision floats |
SHUFPD xmm, xmm/m128, imm8 |
128 bit | imm8 | Shuffle packed doubles |
UNPCKHPS, UNPCKLPS, UNPCKHPD, UNPCKLPD |
128 bit | fixed pattern | Fixed half-lane interleave (no control) |
PALIGNR xmm, xmm/m128, imm8 |
128 bit | imm8 | Byte-align shift within concatenated registers |
VPERMILPS / VPERMILPD (AVX/AVX2) |
128 / 256 bit | imm8 | Permute within 128-bit lanes (constant) |
VPERM2F128 / VPERM2I128 |
256 bit | imm8 | Select 128-bit halves between two registers |
VSHUFF32x4 / VSHUFF64x2 |
512 bit | imm8 | Choose 128-bit lanes within a 512-bit vector |
VPERMPS / VPERMPD (immediate form) |
AVX-512 only (EVEX.imm8) | imm8 | Fixed lane selection pattern |
VPERMT2PS (immediate form) |
512 bit | imm8 | As above |
These all take an imm8 argument, which must be a compile-time constant.
They typically issue as a single µop, often execute on port 5 (and sometimes also port 1 on newer Intel cores), with a 1–3 cycle latency and 1 per-cycle throughput.
They can issue early because the control mask is known at compile time.
Unfortunately, GHC’s current shuffle* primops cannot represent this class of instruction directly.
On the other hand, the runtime-controlled shuffles — the ones GHC exposes today — look like this:
| Instruction | Width | Control | Description |
|---|---|---|---|
PSHUFB xmm, xmm/m128 |
128 bit | vector | Byte-wise permute (SSSE3) |
VPSHUFB (AVX2) |
256 bit | vector | Same for 256-bit, per 128-bit lane |
VPERMILPS ymm, ymm, ymm |
256 bit | vector | Permute 32-bit floats within 128-bit lanes, runtime indices |
VPERMPS ymm, ymm, ymm |
256 bit | vector | General 32-bit permute with runtime indices |
VPERMD ymm, ymm, ymm |
256 bit | vector | Same as above for integers |
VPERMQ, VPERMPD (register-controlled) |
256 bit | vector | Permute 64-bit lanes, runtime-controlled |
VPERMT2PS / VPERMT2PD / VPERMT2Q / VPERMT2D |
512 bit | vector | Ternary permute (mask + control) |
VPERMB, VPERMW |
512 bit | vector | AVX-512 byte/word permutes |
VPSHUFBITQMB |
512 bit | vector | AVX-512 bitwise mask permute |
These take vector register operands as control masks — the shuffle pattern is decided at runtime.
They are far more flexible but significantly heavier in execution:
- Usually 2–4 µops (vs 1 for imm8 forms).
- 3–6 cycle latency, depending on width and core generation (per uops.info).
- Execute on ports 0 + 5 (and occasionally port 1) for most AVX2/AVX-512 designs.
- Some are restricted to operate within 128-bit lanes (e.g.
VPSHUFB) and need multiple µops for cross-lane permutations.
The most general (and most recent and slowest) of these are the forms that GHC’s shuffle* primops are currently able to describe.
To efficiently target the fast imm8-based shuffles,
we’d need a way to pass compile-time constants through to the primop layer so that the backend could emit the corresponding immediate form.
We don’t have that capability in GHC today.
Sources:
- Intel® 64 and IA-32 Architectures Optimization Reference Manual, 2024-10 Edition.
- Agner Fog, “Instruction Tables” v2024.07.15
- uops.info latency/throughput data.
- GHC.Prim documentation.
As I mentioned on reddit, this is not correct: the indices for the shuffle operations currently available in GHC are expected to be compile-time constants.
With
-ddump-asmwe see:This indeed results in a
PSHUFDinstruction with an immediate to control the permutation.Trying to use run-time values fails: