shuffle comment

Those, frankly, are a very poor subset of shuffles.
The shuffle* primops exposed by GHC (for example shuffleFloatX4#) take runtime vector arguments that specify shuffle indices dynamically.
They correspond to register-controlled SIMD shuffle operations on x86.
Unfortunately, those are not efficient variants — they’re just the ones that cleanly map onto our current primop representation, and they often don't even exist on older CPUs.

Let’s take a dive through Intel’s “shuffle-like” operations.
Of these, the first group uses compile-time immediates to select shuffle positions, and thus are the ones we’d want to reach eventually, if we knew the constants involved at all.

Instruction	Width	Control	Description
`PSHUFD xmm, xmm/m128, imm8`	128 bit	imm8	Shuffle 4 × 32-bit lanes arbitrarily
`PSHUFLW xmm, xmm/m128, imm8`	128 bit	imm8	Shuffle low 4 × 16-bit words
`PSHUFHW xmm, xmm/m128, imm8`	128 bit	imm8	Shuffle high 4 × 16-bit words
`SHUFPS xmm, xmm/m128, imm8`	128 bit	imm8	Shuffle packed single-precision floats
`SHUFPD xmm, xmm/m128, imm8`	128 bit	imm8	Shuffle packed doubles
`UNPCKHPS`, `UNPCKLPS`, `UNPCKHPD`, `UNPCKLPD`	128 bit	fixed pattern	Fixed half-lane interleave (no control)
`PALIGNR xmm, xmm/m128, imm8`	128 bit	imm8	Byte-align shift within concatenated registers
`VPERMILPS` / `VPERMILPD` (AVX/AVX2)	128 / 256 bit	imm8	Permute within 128-bit lanes (constant)
`VPERM2F128` / `VPERM2I128`	256 bit	imm8	Select 128-bit halves between two registers
`VSHUFF32x4` / `VSHUFF64x2`	512 bit	imm8	Choose 128-bit lanes within a 512-bit vector
`VPERMPS` / `VPERMPD` (immediate form)	AVX-512 only (EVEX.imm8)	imm8	Fixed lane selection pattern
`VPERMT2PS` (immediate form)	512 bit	imm8	As above

These all take an imm8 argument, which must be a compile-time constant.
They typically issue as a single µop, often execute on port 5 (and sometimes also port 1 on newer Intel cores), with a 1–3 cycle latency and 1 per-cycle throughput.
They can issue early because the control mask is known at compile time.
Unfortunately, GHC’s current shuffle* primops cannot represent this class of instruction directly.

On the other hand, the runtime-controlled shuffles — the ones GHC exposes today — look like this:

Instruction	Width	Control	Description
`PSHUFB xmm, xmm/m128`	128 bit	vector	Byte-wise permute (SSSE3)
`VPSHUFB` (AVX2)	256 bit	vector	Same for 256-bit, per 128-bit lane
`VPERMILPS ymm, ymm, ymm`	256 bit	vector	Permute 32-bit floats within 128-bit lanes, runtime indices
`VPERMPS ymm, ymm, ymm`	256 bit	vector	General 32-bit permute with runtime indices
`VPERMD ymm, ymm, ymm`	256 bit	vector	Same as above for integers
`VPERMQ`, `VPERMPD` (register-controlled)	256 bit	vector	Permute 64-bit lanes, runtime-controlled
`VPERMT2PS` / `VPERMT2PD` / `VPERMT2Q` / `VPERMT2D`	512 bit	vector	Ternary permute (mask + control)
`VPERMB`, `VPERMW`	512 bit	vector	AVX-512 byte/word permutes
`VPSHUFBITQMB`	512 bit	vector	AVX-512 bitwise mask permute

These take vector register operands as control masks — the shuffle pattern is decided at runtime.
They are far more flexible but significantly heavier in execution:

Usually 2–4 µops (vs 1 for imm8 forms).
3–6 cycle latency, depending on width and core generation (per uops.info).
Execute on ports 0 + 5 (and occasionally port 1) for most AVX2/AVX-512 designs.
Some are restricted to operate within 128-bit lanes (e.g. VPSHUFB) and need multiple µops for cross-lane permutations.

The most general (and most recent and slowest) of these are the forms that GHC’s shuffle* primops are currently able to describe.

To efficiently target the fast imm8-based shuffles,
we’d need a way to pass compile-time constants through to the primop layer so that the backend could emit the corresponding immediate form.
We don’t have that capability in GHC today.

Sources:

Intel® 64 and IA-32 Architectures Optimization Reference Manual, 2024-10 Edition.
Agner Fog, “Instruction Tables” v2024.07.15
uops.info latency/throughput data.
GHC.Prim documentation.

ekmett/shuffles.md

Select an option

No results found

Select an option

No results found

sheaf commented Oct 27, 2025

Uh oh!

sheaf commented Oct 27, 2025 •

edited

Loading

Uh oh!

ekmett/shuffles.md

sheaf commented Oct 27, 2025

Uh oh!

sheaf commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sheaf commented Oct 27, 2025 •

edited

Loading