Benchmarking global reductions with varying numbers of SMs and warps per SM. We find interesting facts, like that using one warp across more SMs is more efficient than more warps across a single SM.
These were computed on an NVIDIA Titan X GPU.
Code for the benchmark is here and the kernel is here.
1 warps | 2 warps | 4 warps | 8 warps | 16 warps | 32 warps | |
1 SMs | 5.74 GiB/s | 12.20 GiB/s | 24.26 GiB/s | 47.81 GiB/s | 82.55 GiB/s | 98.77 GiB/s |
2 SMs | 12.11 GiB/s | 24.39 GiB/s | 48.48 GiB/s | 92.71 GiB/s | 153.96 GiB/s | 189.01 GiB/s |
4 SMs | 24.21 GiB/s | 48.55 GiB/s | 96.13 GiB/s | 176.56 GiB/s | 257.26 GiB/s | 303.50 GiB/s |
8 SMs | 48.36 GiB/s | 96.50 GiB/s | 181.74 GiB/s | 294.65 GiB/s | 335.24 GiB/s | 337.33 GiB/s |
16 SMs | 96.92 GiB/s | 182.65 GiB/s | 309.08 GiB/s | 334.78 GiB/s | 336.86 GiB/s | 338.26 GiB/s |
32 SMs | 180.08 GiB/s | 304.42 GiB/s | 329.64 GiB/s | 336.86 GiB/s | 333.21 GiB/s | 331.60 GiB/s |
64 SMs | 179.45 GiB/s | 257.40 GiB/s | 304.99 GiB/s | 330.89 GiB/s | 334.68 GiB/s | 329.70 GiB/s |
We can take a reduction over sin(x)
instead of x
to throw more compute instructions into the mix. I've tried that here. Here's the resulting table:
1 warps | 2 warps | 4 warps | 8 warps | 16 warps | 32 warps | |
1 SMs | 5.24 GiB/s | 11.76 GiB/s | 24.22 GiB/s | 45.89 GiB/s | 78.95 GiB/s | 100.73 GiB/s |
2 SMs | 11.21 GiB/s | 24.19 GiB/s | 47.97 GiB/s | 89.09 GiB/s | 145.61 GiB/s | 191.27 GiB/s |
4 SMs | 23.05 GiB/s | 48.07 GiB/s | 93.10 GiB/s | 170.31 GiB/s | 245.83 GiB/s | 303.49 GiB/s |
8 SMs | 45.72 GiB/s | 93.73 GiB/s | 174.85 GiB/s | 290.67 GiB/s | 331.86 GiB/s | 338.39 GiB/s |
16 SMs | 88.97 GiB/s | 175.29 GiB/s | 303.97 GiB/s | 334.67 GiB/s | 337.68 GiB/s | 338.43 GiB/s |
32 SMs | 166.40 GiB/s | 300.94 GiB/s | 328.26 GiB/s | 335.70 GiB/s | 333.65 GiB/s | 331.95 GiB/s |
64 SMs | 168.66 GiB/s | 254.25 GiB/s | 302.75 GiB/s | 329.47 GiB/s | 334.93 GiB/s | 332.62 GiB/s |
Here's the same code as above, but on an H100:
1 warps | 2 warps | 4 warps | 8 warps | 16 warps | 32 warps | |
1 SMs | 2.94 GiB/s | 5.57 GiB/s | 11.02 GiB/s | 21.26 GiB/s | 41.59 GiB/s | 81.47 GiB/s |
2 SMs | 5.85 GiB/s | 11.04 GiB/s | 21.90 GiB/s | 42.12 GiB/s | 82.52 GiB/s | 162.32 GiB/s |
4 SMs | 11.71 GiB/s | 22.07 GiB/s | 43.71 GiB/s | 84.23 GiB/s | 165.47 GiB/s | 323.75 GiB/s |
8 SMs | 23.40 GiB/s | 44.03 GiB/s | 87.35 GiB/s | 168.77 GiB/s | 330.24 GiB/s | 634.83 GiB/s |
16 SMs | 46.70 GiB/s | 88.15 GiB/s | 174.68 GiB/s | 336.76 GiB/s | 646.65 GiB/s | 1202.34 GiB/s |
32 SMs | 91.74 GiB/s | 173.64 GiB/s | 344.12 GiB/s | 651.42 GiB/s | 1200.77 GiB/s | 2073.66 GiB/s |
64 SMs | 183.51 GiB/s | 347.69 GiB/s | 676.22 GiB/s | 1228.53 GiB/s | 2076.83 GiB/s | 2713.38 GiB/s |