Last active
May 7, 2025 16:00
-
-
Save sebbbi/6cfbec7ab343924dad9b7ee48ef3ba6c to your computer and use it in GitHub Desktop.
Single pass globallycoherent mip pyramid generation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// NOTE: Must bind 8x single mip RWTexture views, because HLSL doesn't have .mips member for RWTexture2D. (SRVs only have .mips member) | |
// NOTE: globallycoherent attribute is needed. Without it writes aren't guaranteed to be seen by other groups | |
globallycoherent RWTexture2D<float> MipTextures[8]; | |
RWTexture2D<uint> Counters[8]; | |
groupshared uint CounterReturnLDS; | |
[numthreads(16, 16, 1)] | |
void GenerateMipPyramid(uint3 Tid : SV_DispatchThreadID, uint3 Group : SV_GroupId, uint Gix : SV_GroupIndex) | |
{ | |
[unroll] | |
for (int Mip = 0; Mip < 8-1; ++Mip) | |
{ | |
// 2x2 downsample | |
float Sum = | |
MipTextures[Mip][Tid.xy * 2 + uint2(0, 0)] + | |
MipTextures[Mip][Tid.xy * 2 + uint2(1, 0)] + | |
MipTextures[Mip][Tid.xy * 2 + uint2(0, 1)] + | |
MipTextures[Mip][Tid.xy * 2 + uint2(1, 1)]; | |
MipTextures[Mip+1][Tid.xy] = Sum * 0.25; | |
// Four groups in 2x2 tile of groups increment the same counter. | |
if (Gix == 0) | |
{ | |
InterlockedAdd(Counters[Mip][Group.xy / 2], 1, CounterReturnLDS); | |
} | |
// We do a full memory barrier here. In next mip the surviving thread group will read data generated by 3 other thread groups. Data needs to be visible. | |
AllMemoryBarrierWithGroupSync(); | |
// Kill all groups except the last one to finish in 2x2 tile. This branch is allowed because CounterReturnLDS is group invariant. | |
if (CounterReturnLDS < 3) | |
{ | |
return; | |
} | |
// Needed to ensure that all threads in group read CounterReturnLDS before it is modified in next loop iteration | |
GroupMemoryBarrierWithGroup(); | |
Tid.xy /= 2; | |
Group.xy /= 2; | |
} | |
} |
Ah, of course it's necessary because there's no other way to know when 2x2 tile of groups has finished. But if those groups finish in random order, how does that tid.xy /= 2; produce correct coordinates for next iteration?
Any updates on getting a fixed version of this? :)
Hey, we have tried this version of the gist and it is definitely slower on
Radeon RX 580 and NVidia 2060 than the version in the mini engine for directx.
The version in direct xdoes uses lds for now, if it uses waveintrinsics it will be faster.
Can you comment?
On 4096x4096 on RX 580
Gist : 1806240 ns
Miniengine: 875680 ns
Do we do something wrong?
Measure with pix 1908.02
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There's probably a lot about compute shaders I don't know, but I wonder why this can't be done with just:
removing that return makes 6ms->23ms, so something happens to the threadgroups. why is the LDS necessary?