One feature that is clearly out of scope for WebGPU 1.0 but is desired for the near future is subgroups. It is a way to move data between threads within a workgroup with less overhead and latency than workgroup shared memory, but poses more challenges for portability. While almost all modern GPU hardware supports subgroup operations, the feature poses significant compatibility challenges. In particular, while workgroup size is determined by the programmer within generous ranges (WebGPU requires a minimum maximum of 256), subgroup sizes vary by hardware and also compiler heuristics. Shaders need be written in a way that adapts to a wide range of subgroup sizes, which is quite challenging.
This issue will be written largely from the perspective of accelerating prefix sum operations (an important primitive within Vello), but there are many potential applications. One relatively recent development is cooperative matrix operations, which are supported in most newer GPU hardware a