Skip to content

Instantly share code, notes, and snippets.

@sebbbi
Last active October 11, 2024 06:34
Show Gist options
  • Save sebbbi/ba4415339b535d22fb18e2d824564ec4 to your computer and use it in GitHub Desktop.
Save sebbbi/ba4415339b535d22fb18e2d824564ec4 to your computer and use it in GitHub Desktop.
Fast uniform load with wave ops (up to 64x speedup)
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.
Simplified HLSL code looks like this:
Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;
[numthreads(8, 8, 1)]
void CSMain(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID)
{
uint2 lightStartCount = lightStartCounts[gid.xy];
float4 lightAccumulator = 0;
for (uint i = 0; i < lightStartCount.y; ++i)
{
uint lightIndex = lightStartCount.x + i;
// In real impl light data would be position, radius, color, etc.
float4 lightData = lightDatas[lightIndex]; // 64-wide mem load. All lanes load from same address.
lightAccumulator += lightData;
}
output[tid.xy] = lightAccumulator;
}
The problem with this code is that we load the same light data for each lane. If you would have used constant buffer instead of
Buffer<float4>, the compiler would have generated special code utilizing GPU specific uniform load facility like constant buffer
hardware (Nvidia) or scalar unit (AMD). But this can be problematic, since constant buffers have 64KB size limit, and constant buffer
array elements must be float4/int4 (16 byte alignment). Not even enough space to hold 1080p tile light lists.
Fortunately with SM 6.0 wave intrinsics we can do better. We can load 32 (Nvidia) or 64 (AMD) ligths at once using a single load
instruction and then use WaveReadLaneAt to broadcast light data from one lane to all lanes, one lane at a time. This reduces the number
of load instructions by 32x / 64x. Also there's no register bloat, since each loaded lane has now unique data, instead of replicated
value. Register file usage is as good as with AMD scalar loads.
Optimized code looks like this:
[numthreads(8, 8, 1)]
void CSMainWave(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint gix : SV_GroupIndex)
{
const uint WAVE_LANES = WaveGetLaneCount(); // Usually 64 or 32. Compile time constant.
uint2 lightStartCount = lightStartCounts[gid.xy];
float4 lightAccumulator = 0;
for (uint i = 0; i < divRoundUp(lightStartCount.y, WAVE_LANES); ++i)
{
// Load 32/64 unique items at once in outer loop. 32x/64x reduction in mem loads.
float4 lightData64 = lightDatas[lightStartCount.x + gix + i * WAVE_LANES];
uint loopIters = min(WAVE_LANES, lightStartCount.y - WAVE_LANES * i);
for (uint i2 = 0; i2 < loopIters; ++i2)
{
// In real impl light data would be position, radius, color, etc.
float4 lightData = WaveReadLaneAt(lightData64, i2); // Broadcast current light to all lanes
lightAccumulator += lightData;
}
}
output[tid.xy] = lightAccumulator;
}
This code actually will beat scalar optimized code on AMD, since one vector load is faster than 64 scalar loads. Another advantage is
that this optimization works for all buffer and texture types, not just for constant/raw/structured buffers.
According to my GPU buffer benchmark (https://github.com/sebbbi/perftest), new Nvidia Maxwell/Pascal/Volta/Turing drivers employ
uniform address optimization for all buffer and texture types. I have seen 20x-27x performance increases for cases where all threads
inside a loop load from uniform address. My guess is this Nvidia optimization employed in their new compiler is similar to the
optimization I described above. I don't have access to their PIX plugin to validate my claim. Maybe someone with access to that plugin
can validate my assumption (if NDA is not a problem).
Recent AMD drivers employ a similar optimization for global atomic add. This can be validated with publicly available tools such as
RenderDoc. If all lanes add C or 0, the compiler first does WavePrefixCountBits (MBCNT) and then a single lane global atomic, reducing
serialization to atomic counter by 64x.
I hope that AMD also implements the optimization above to their shader compiler. In AMDs case, this optimization is actually super
good, since in most cases AMD gets also full memory coalescing gain from this optimizationg, resulting in 4x load issue rate.
Shader playground link:
http://shader-playground.timjones.io/04cb6af828fe0e1a48d651e25d06cffb
@SungJJinKang
Copy link

Awsome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment