Skip to content

Instantly share code, notes, and snippets.

@ekmett
Created October 25, 2025 17:34
Show Gist options
  • Save ekmett/3778482ee6b9365685dc80de9a68a1db to your computer and use it in GitHub Desktop.
Save ekmett/3778482ee6b9365685dc80de9a68a1db to your computer and use it in GitHub Desktop.
shuffle comment

Those, frankly, are a very poor subset of shuffles.
The shuffle* primops exposed by GHC (for example shuffleFloatX4#) take runtime vector arguments that specify shuffle indices dynamically.
They correspond to register-controlled SIMD shuffle operations on x86.
Unfortunately, those are not efficient variants — they’re just the ones that cleanly map onto our current primop representation, and they often don't even exist on older CPUs.

Let’s take a dive through Intel’s “shuffle-like” operations.
Of these, the first group uses compile-time immediates to select shuffle positions, and thus are the ones we’d want to reach eventually, if we knew the constants involved at all.

Instruction Width Control Description
PSHUFD xmm, xmm/m128, imm8 128 bit imm8 Shuffle 4 × 32-bit lanes arbitrarily
PSHUFLW xmm, xmm/m128, imm8 128 bit imm8 Shuffle low 4 × 16-bit words
PSHUFHW xmm, xmm/m128, imm8 128 bit imm8 Shuffle high 4 × 16-bit words
SHUFPS xmm, xmm/m128, imm8 128 bit imm8 Shuffle packed single-precision floats
SHUFPD xmm, xmm/m128, imm8 128 bit imm8 Shuffle packed doubles
UNPCKHPS, UNPCKLPS, UNPCKHPD, UNPCKLPD 128 bit fixed pattern Fixed half-lane interleave (no control)
PALIGNR xmm, xmm/m128, imm8 128 bit imm8 Byte-align shift within concatenated registers
VPERMILPS / VPERMILPD (AVX/AVX2) 128 / 256 bit imm8 Permute within 128-bit lanes (constant)
VPERM2F128 / VPERM2I128 256 bit imm8 Select 128-bit halves between two registers
VSHUFF32x4 / VSHUFF64x2 512 bit imm8 Choose 128-bit lanes within a 512-bit vector
VPERMPS / VPERMPD (immediate form) AVX-512 only (EVEX.imm8) imm8 Fixed lane selection pattern
VPERMT2PS (immediate form) 512 bit imm8 As above

These all take an imm8 argument, which must be a compile-time constant.
They typically issue as a single µop, often execute on port 5 (and sometimes also port 1 on newer Intel cores), with a 1–3 cycle latency and 1 per-cycle throughput.
They can issue early because the control mask is known at compile time.
Unfortunately, GHC’s current shuffle* primops cannot represent this class of instruction directly.

On the other hand, the runtime-controlled shuffles — the ones GHC exposes today — look like this:

Instruction Width Control Description
PSHUFB xmm, xmm/m128 128 bit vector Byte-wise permute (SSSE3)
VPSHUFB (AVX2) 256 bit vector Same for 256-bit, per 128-bit lane
VPERMILPS ymm, ymm, ymm 256 bit vector Permute 32-bit floats within 128-bit lanes, runtime indices
VPERMPS ymm, ymm, ymm 256 bit vector General 32-bit permute with runtime indices
VPERMD ymm, ymm, ymm 256 bit vector Same as above for integers
VPERMQ, VPERMPD (register-controlled) 256 bit vector Permute 64-bit lanes, runtime-controlled
VPERMT2PS / VPERMT2PD / VPERMT2Q / VPERMT2D 512 bit vector Ternary permute (mask + control)
VPERMB, VPERMW 512 bit vector AVX-512 byte/word permutes
VPSHUFBITQMB 512 bit vector AVX-512 bitwise mask permute

These take vector register operands as control masks — the shuffle pattern is decided at runtime.
They are far more flexible but significantly heavier in execution:

  • Usually 2–4 µops (vs 1 for imm8 forms).
  • 3–6 cycle latency, depending on width and core generation (per uops.info).
  • Execute on ports 0 + 5 (and occasionally port 1) for most AVX2/AVX-512 designs.
  • Some are restricted to operate within 128-bit lanes (e.g. VPSHUFB) and need multiple µops for cross-lane permutations.

The most general (and most recent and slowest) of these are the forms that GHC’s shuffle* primops are currently able to describe.

To efficiently target the fast imm8-based shuffles,
we’d need a way to pass compile-time constants through to the primop layer so that the backend could emit the corresponding immediate form.
We don’t have that capability in GHC today.

Sources:

@sheaf
Copy link

sheaf commented Oct 27, 2025

As I mentioned on reddit, this is not correct: the indices for the shuffle operations currently available in GHC are expected to be compile-time constants.

myShuf :: Int32X4# -> Int32X4#
myShuf u = shuffleInt32X4# u u (# 3#, 2#, 1#, 0# #)

With -ddump-asm we see:

myShuf_info:
	pshufd $27,%xmm1,%xmm0
	movdqu %xmm0,%xmm1
	jmp *(%rbp)

This indeed results in a PSHUFD instruction with an immediate to control the permutation.

Trying to use run-time values fails:

rejected :: Int# -> Int32X4# -> Int32X4#
rejected i = shuffleInt32X4# u u (# i, 2#, 1#, 0# #)
error:
    Vector shuffle: shuffle indices must be literals, 0 <= i < 8

@sheaf
Copy link

sheaf commented Oct 27, 2025

To efficiently target the fast imm8-based shuffles, we’d need a way to pass compile-time constants through to the primop layer so that the backend could emit the corresponding immediate form.
We don’t have that capability in GHC today.

... but that is exactly what we do today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment