Last active
October 11, 2024 06:34
-
-
Save sebbbi/ba4415339b535d22fb18e2d824564ec4 to your computer and use it in GitHub Desktop.
Fast uniform load with wave ops (up to 64x speedup)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader | |
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile. | |
Simplified HLSL code looks like this: | |
Buffer<float4> lightDatas; | |
Texture2D<uint2> lightStartCounts; | |
RWTexture2D<float4> output; | |
[numthreads(8, 8, 1)] | |
void CSMain(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID) | |
{ | |
uint2 lightStartCount = lightStartCounts[gid.xy]; | |
float4 lightAccumulator = 0; | |
for (uint i = 0; i < lightStartCount.y; ++i) | |
{ | |
uint lightIndex = lightStartCount.x + i; | |
// In real impl light data would be position, radius, color, etc. | |
float4 lightData = lightDatas[lightIndex]; // 64-wide mem load. All lanes load from same address. | |
lightAccumulator += lightData; | |
} | |
output[tid.xy] = lightAccumulator; | |
} | |
The problem with this code is that we load the same light data for each lane. If you would have used constant buffer instead of | |
Buffer<float4>, the compiler would have generated special code utilizing GPU specific uniform load facility like constant buffer | |
hardware (Nvidia) or scalar unit (AMD). But this can be problematic, since constant buffers have 64KB size limit, and constant buffer | |
array elements must be float4/int4 (16 byte alignment). Not even enough space to hold 1080p tile light lists. | |
Fortunately with SM 6.0 wave intrinsics we can do better. We can load 32 (Nvidia) or 64 (AMD) ligths at once using a single load | |
instruction and then use WaveReadLaneAt to broadcast light data from one lane to all lanes, one lane at a time. This reduces the number | |
of load instructions by 32x / 64x. Also there's no register bloat, since each loaded lane has now unique data, instead of replicated | |
value. Register file usage is as good as with AMD scalar loads. | |
Optimized code looks like this: | |
[numthreads(8, 8, 1)] | |
void CSMainWave(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint gix : SV_GroupIndex) | |
{ | |
const uint WAVE_LANES = WaveGetLaneCount(); // Usually 64 or 32. Compile time constant. | |
uint2 lightStartCount = lightStartCounts[gid.xy]; | |
float4 lightAccumulator = 0; | |
for (uint i = 0; i < divRoundUp(lightStartCount.y, WAVE_LANES); ++i) | |
{ | |
// Load 32/64 unique items at once in outer loop. 32x/64x reduction in mem loads. | |
float4 lightData64 = lightDatas[lightStartCount.x + gix + i * WAVE_LANES]; | |
uint loopIters = min(WAVE_LANES, lightStartCount.y - WAVE_LANES * i); | |
for (uint i2 = 0; i2 < loopIters; ++i2) | |
{ | |
// In real impl light data would be position, radius, color, etc. | |
float4 lightData = WaveReadLaneAt(lightData64, i2); // Broadcast current light to all lanes | |
lightAccumulator += lightData; | |
} | |
} | |
output[tid.xy] = lightAccumulator; | |
} | |
This code actually will beat scalar optimized code on AMD, since one vector load is faster than 64 scalar loads. Another advantage is | |
that this optimization works for all buffer and texture types, not just for constant/raw/structured buffers. | |
According to my GPU buffer benchmark (https://github.com/sebbbi/perftest), new Nvidia Maxwell/Pascal/Volta/Turing drivers employ | |
uniform address optimization for all buffer and texture types. I have seen 20x-27x performance increases for cases where all threads | |
inside a loop load from uniform address. My guess is this Nvidia optimization employed in their new compiler is similar to the | |
optimization I described above. I don't have access to their PIX plugin to validate my claim. Maybe someone with access to that plugin | |
can validate my assumption (if NDA is not a problem). | |
Recent AMD drivers employ a similar optimization for global atomic add. This can be validated with publicly available tools such as | |
RenderDoc. If all lanes add C or 0, the compiler first does WavePrefixCountBits (MBCNT) and then a single lane global atomic, reducing | |
serialization to atomic counter by 64x. | |
I hope that AMD also implements the optimization above to their shader compiler. In AMDs case, this optimization is actually super | |
good, since in most cases AMD gets also full memory coalescing gain from this optimizationg, resulting in 4x load issue rate. | |
Shader playground link: | |
http://shader-playground.timjones.io/04cb6af828fe0e1a48d651e25d06cffb |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Cool!