Matrix transpose blog post outline

Writing for GPU involves breaking problems into primitives.
- Some primitives can naturally run in parallel; this is the easy part.
- Others are used to coordinate work between different threads.
- An example of this is transpose of a square matrix bitmap. It is used in piet-gpu to assign work to tiles.
- This post will examine the performance of that transpose task in detail.
The bitmap transpose problem requires coordination between threads. GPU compute has two mechanisms for doing this: threadgroup shared memory and subgroups.
- Fundamental tradeoff: subgroup is faster, but programmer has more control using threadgroup shared memory.
  - Also compatibility. Not all GPUs fully support subgroups (shuffle is missing in DX12 / HLSL).
  - The main theme of this post is to explore the tradeoff. How much faster are subgroups? In what circumstances do the problems with subgroups surface?
- introduce hybrid shuffle in this context. Note our surprise in finding that hybrid shuffle has poor performance (even worse than threadgroups on discrete GPUs). Why is this? We don't know.
Threadgroup vs subgroup speed
- This is where we show the graphs
  - Point out salient features; explain that knee of graph reveals how much parallelism is available
    - Less parallelism available on Intel for threadgroup shared memory, as expected
- Quite hardware dependent; minimal on AMD, significant on Nvidia, dramatic on Intel
  - But Intel has smaller (and not dependable) subgroup sizes
    - we have to cut the problem to 8x8. But then Intel does very well.
    - A risk: doing development / optimization on one platform might not transfer well to others. If you develop on AMD, you'll won't see significant benefit to subgroups.
    - might be a good point to discuss the Vulkan subgroup size control extension, and that vk physical device query doesn't accurately report. [In writing this outline, I find it difficult to believe that GL_SUBGROUP_SIZE doesn't accurately report, that feels like a bug]
Other approaches tried
- Hybrid threadgroup + shuffle
  - poor performance: why? A mystery
- Ballot
  - Here we know why the performance is poor; you iterate over n bit positions, while the other algorithms are lg(n) operations. Also, branch with divergence is slow.
Conclusions
- Using threadgroup shared memory is the most consistent, portable, and reliable approach.
- Subgroups can potentially be a win, but the extent of the win is harware dependent.
  - On Intel, they're only a win if the size of the problem can adapt to the subgroup size.
- Simpler approaches win; our attempts to do a fancier hybrid approach were not successful.

raphlinus/transpose_blogpost_outline.md