-
Star
(104)
You must be signed in to star a gist -
Fork
(4)
You must be signed in to fork a gist
-
-
Save TheRealMJP/c83b8c0f46b63f3a88a5986f4fa982b1 to your computer and use it in GitHub Desktop.
// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae | |
// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16. | |
// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details | |
float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize) | |
{ | |
// We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding | |
// down the sample location to get the exact center of our "starting" texel. The starting texel will be at | |
// location [1, 1] in the grid, where [0, 0] is the top left corner. | |
float2 samplePos = uv * texSize; | |
float2 texPos1 = floor(samplePos - 0.5f) + 0.5f; | |
// Compute the fractional offset from our starting texel to our original sample location, which we'll | |
// feed into the Catmull-Rom spline function to get our filter weights. | |
float2 f = samplePos - texPos1; | |
// Compute the Catmull-Rom weights using the fractional offset that we calculated earlier. | |
// These equations are pre-expanded based on our knowledge of where the texels will be located, | |
// which lets us avoid having to evaluate a piece-wise function. | |
float2 w0 = f * (-0.5f + f * (1.0f - 0.5f * f)); | |
float2 w1 = 1.0f + f * f * (-2.5f + 1.5f * f); | |
float2 w2 = f * (0.5f + f * (2.0f - 1.5f * f)); | |
float2 w3 = f * f * (-0.5f + 0.5f * f); | |
// Work out weighting factors and sampling offsets that will let us use bilinear filtering to | |
// simultaneously evaluate the middle 2 samples from the 4x4 grid. | |
float2 w12 = w1 + w2; | |
float2 offset12 = w2 / (w1 + w2); | |
// Compute the final UV coordinates we'll use for sampling the texture | |
float2 texPos0 = texPos1 - 1; | |
float2 texPos3 = texPos1 + 2; | |
float2 texPos12 = texPos1 + offset12; | |
texPos0 /= texSize; | |
texPos3 /= texSize; | |
texPos12 /= texSize; | |
float4 result = 0.0f; | |
result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos0.y), 0.0f) * w0.x * w0.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos0.y), 0.0f) * w12.x * w0.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos0.y), 0.0f) * w3.x * w0.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos12.y), 0.0f) * w0.x * w12.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos12.y), 0.0f) * w12.x * w12.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos12.y), 0.0f) * w3.x * w12.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos3.y), 0.0f) * w0.x * w3.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos3.y), 0.0f) * w12.x * w3.y; | |
result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos3.y), 0.0f) * w3.x * w3.y; | |
return result; | |
} |
btw a coworker suggested this small optimization:
// get rid of f3, and:
float2 w0 = (1.0f / 2.0f) * f * (-1.0f + f * (2.0f - f));
float2 w1 = (1.0f / 6.0f) * f2 * (-15.0f + 9.0f * f) + 1.0f;
float2 w2 = (1.0f / 6.0f) * f * (3.0f + f * (12.0f - f * 9.0f));
float2 w3 = (1.0f / 2.0f) * f2 * (f - 1.0f);
Checking with Pyramid using AMDDXX for Bonaire target:
VGPRs: 51 -> 49
VALU: 147 -> 146
Alternatively putting the polynomials straight in horner-form:
float2 w0 = f * ( -0.5 + f * (1.0 - 0.5*f));
float2 w1 = 1.0 + f * f * (-2.5 + 1.5*f );
float2 w2 = f * ( 0.5 + f * (2.0 - 1.5*f) );
float2 w3 = f * f * (-0.5 + 0.5 * f);
Pyramid, AMDDXX, Bonaire ( http://pastebin.com/12ccE9Lk )
VGPRs: 55 -> 47
VALU: 146 -> 135
Thanks guys! I updated the code with the optimizations.
Wouldn't this be more optimal with use of Gather()?
https://docs.microsoft.com/en-us/windows/desktop/direct3dhlsl/dx-graphics-hlsl-to-gather
If you are doing the filtering yourself and you want to use a linear buffer, you can use rawBuffer0.Load4()
coherency might or might not be worse, it depends. Dynamic updates are usually easier.
For the 5 taps should we renormalize weights?
float weight = w12.x * w0.y + w0.x * w12.y + w12.x * w12.y + w3.x * w12.y + w12.x * w3.y;
result /= weight;
Some quick benchmark results with an R9 380:
All of these numbers were gathered by using the above code for reprojecting the previous frame's result for the purpose of TAA, using the following shader: https://github.com/TheRealMJP/MSAAFilter/blob/master/MSAAFilter/Resolve.hlsl.
The above code was used exactly for testing the 9-tap version. The 1-tap version just uses bilinear filtering, and is there for for reference. The 16-tap version used a modified version of the above function that performs 16 texture loads, with no sampling or filtering (I didn't use the filtering code that's checked in for that file, which has several branches for choosing the filter kernel and a few other options). The 5-tap version is the same as the above except that it omits the corner taps, as suggested by Jorge Jimenez in his SIGGRAPH 2016 presentation about Filmic SMAA: http://advances.realtimerendering.com/s2016/Filmic%20SMAA%20v7.pptx
Here's some more timings captured with a GTX 980: