Skip to content

Instantly share code, notes, and snippets.

@raphlinus
Last active April 2, 2020 01:33
Show Gist options
  • Save raphlinus/e91a5cfa891c6878757459e0f8e1dc44 to your computer and use it in GitHub Desktop.
Save raphlinus/e91a5cfa891c6878757459e0f8e1dc44 to your computer and use it in GitHub Desktop.
Matrix transpose blog post outline
  • Writing for GPU involves breaking problems into primitives.

    • Some primitives can naturally run in parallel; this is the easy part.

    • Others are used to coordinate work between different threads.

    • An example of this is transpose of a square matrix bitmap. It is used in piet-gpu to assign work to tiles.

    • This post will examine the performance of that transpose task in detail.

  • The bitmap transpose problem requires coordination between threads. GPU compute has two mechanisms for doing this: threadgroup shared memory and subgroups.

    • Fundamental tradeoff: subgroup is faster, but programmer has more control using threadgroup shared memory.

      • Also compatibility. Not all GPUs fully support subgroups (shuffle is missing in DX12 / HLSL).

      • The main theme of this post is to explore the tradeoff. How much faster are subgroups? In what circumstances do the problems with subgroups surface?

    • introduce hybrid shuffle in this context. Note our surprise in finding that hybrid shuffle has poor performance (even worse than threadgroups on discrete GPUs). Why is this? We don't know.

  • Threadgroup vs subgroup speed

    • This is where we show the graphs

      • Point out salient features; explain that knee of graph reveals how much parallelism is available

        • Less parallelism available on Intel for threadgroup shared memory, as expected
    • Quite hardware dependent; minimal on AMD, significant on Nvidia, dramatic on Intel

      • But Intel has smaller (and not dependable) subgroup sizes

        • we have to cut the problem to 8x8. But then Intel does very well.

        • A risk: doing development / optimization on one platform might not transfer well to others. If you develop on AMD, you'll won't see significant benefit to subgroups.

        • might be a good point to discuss the Vulkan subgroup size control extension, and that vk physical device query doesn't accurately report. [In writing this outline, I find it difficult to believe that GL_SUBGROUP_SIZE doesn't accurately report, that feels like a bug]

  • Other approaches tried

    • Hybrid threadgroup + shuffle

      • poor performance: why? A mystery
    • Ballot

      • Here we know why the performance is poor; you iterate over n bit positions, while the other algorithms are lg(n) operations. Also, branch with divergence is slow.
  • Conclusions

    • Using threadgroup shared memory is the most consistent, portable, and reliable approach.

    • Subgroups can potentially be a win, but the extent of the win is harware dependent.

      • On Intel, they're only a win if the size of the problem can adapt to the subgroup size.
    • Simpler approaches win; our attempts to do a fancier hybrid approach were not successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment