Very much half-written draft of prefix sum post

layout

title

date

Why prefix sum?

As HN user fluffything points out in this HN thread on my Taste of GPU compute talk, prefix sum is an excellent benchmark for evaluating GPU compute languages and runtimes.

For one, it is useful in and of itself. I use it in font-rs to integrate fragments of exact-area computations to arrive at the total area coverage for font rendering. It is also used as a primitive in many more operations, including GPU-side dynamic allocation (each element of the input array computes the number of storage cells needed, then a prefix sum lays that out sequentially) and compaction.

For two, it is simple. The sequential version can be expressed in just a handful of lines of code:

def prefix_sum(a):
    s = 0
    result = []
    for x in a:
        s += a
        result.append(s)
    return result

For three, it is challenging but possible to implement efficiently on GPU. The above code has a strictly sequential dependency, but because addition is associative, it is possible to exploit a great deal of parallelism, and there is literature on that going back decades. Even so, actually exploiting that parallelism requires communication between invocations ("threads" in more common GPU lingo) and careful attention to the memory hierarchy.

The generalization of prefix sum is called "scan," and works with any associative operation, not just addition. It doesn't even have to be commutative; examples of that include regular expressions and IIR filtering.

Implementation on GPU

The state of the art is decoupled look-back. I'm not going to try to summarize it here, but recommend reading the paper. The results are impressive - for large data sets they report reaching memcpy speeds, meaning that no further speedup is possible.

That work is a refinement of Parallel Prefix Sum (Scan) with CUDA from NVidia's GPU Gems 3 book.

Do we need a memory model?

The sample code is written in terms of the new Vulkan memory model, in large part because I wanted to experiment with it. I've done a fair amount of lock-free code using the C++ memory model, from which Vulkan derives. And lock-free code, while fairly rare on the CPU (my main motivation is to avoid priority inversion for real time audio), is more or less required on the GPU, because mutex is not available in kernel code. Even if it were, it would create a lot of problems, as it would block the entire subgroup, not just a single thread (one of the features of the Vulkan memory model is a much weaker forward progress guarantee than threads running on CPU).

Is a memory model absolutely required to run this code? If you replace the atomic loads and stores with simple array accesses, it fails. However, at least on my hardware, correct operation can be recovered by adding the volatile qualifier to the WorkBuf array. This disables compiler optimizations that might otherwise assume that the buffer contents can't change other than being modified by other invocations within the workgroup.

I'm feeling some motivation to write a blog post on scan, but to do it properly requires some fairly time-intensive work, especially doing a performance comparison against CUB. Much of the context is this HN thread.

The outline of the blog post would be:

Introduce prefix sum / scan. Motivate its use as a primitive for allocation and compaction. Point to interesting applications with other associative (but not necessarily commutative) operations including regular expressions and IIR filters
Point to the decoupled look-back paper.
Introduce Vulkan / GLSL code. (Note: this would need more work, especially for meaningful performance comparison)
- Determining subgroup size. Broken gl_SubgroupSize on Intel. Use of VK_EXT_subgroup_size_control to manage it.
- Use of Vulkan memory model. Does this work at all without a memory model (ie putting in memory barriers instead of atomics with acquire/release semantics?) In a bit of testing, it doesn't, but I might be doing it wrong.
  - If the use of memory semantics turns out to be essential for correctness, or important for performance, then the entire project of transpiling to HLSL / Metal is unworkable for this kind of workload. Vk is the only viable path forward.
What does subgroupExclusiveAdd compile to? AMD has an additional level of hierarchy between subgroup and invocation (subgroup = wave = 64, additional level = row = 16). You care if you're doing your own binary operator, then you have to use subgroup shuffle operations and a tree reduction.
A challenge: prefix sum should be a fundamental benchmark for anyone proposing platforms for GPU compute. The challenge goes both up and down the stack.
- Down the stack: can the platform (WebGPU, I'm looking at you) run this workload at all? Efficiently? Challenges include transpiling the SPIR-V, runtime detection of capabilities, and of course the subgroup size debacle.
- Up the stack: can your higher level shader language (emu, I'm looking at you) express the decoupled look-back algorithm without loss of performance? Can you compose pieces of code, for example specify your own binary operator? Will it automatically specialize to use subgroupExclusiveAdd if the operator is +? Can it generate a kernel that fuses this scan operation with a map on input and/or output? The NV ecosystem uses C++ templates very extensively for all this.
Main conclusion: if you want to run serious compute workloads on diverse hardware, and have patience to work with very primitive and low-level tools, Vk is viable. There is a massive opportunity to build an ecosystem on top of Vk/SPV to compete with CUDA.

FWIW, the performance of that code is about 4x slower than memcpy, while CUB claims 1x. I believe I understand how to get there, it's just a nontrivial amount of work:

Decrease the amount of redundant reduction work at the subgroup level, retaining the aggregates across the phases.
Parallelize the look-back as suggested in the paper.
Use a smaller workgroup (while retaining large tile sizes; iterate) so many fewer cores are kept idle while doing the look-back. Explore workgroup = subgroup and use subgroup broadcasts instead of shared memory.

raphlinus/prefix_sum_draft.md

Why prefix sum?

Implementation on GPU

Do we need a memory model?