The NEON rasterizer in neon_overlay.cpp is already the most taxing single pass in the pipeline. It runs only over the screen-space footprint of the dice — roughly 60,000–80,000 pixels per frame when several dice are in flight. Per batch of four pixels, the NEON path handles perspective-correct UV interpolation with two vector instructions, then falls back to scalar code for the texture fetch, tangent-space lighting, and framebuffer blend. Roughly ~20 floating-point operations per pixel, processed four at a time.
A separable Gaussian blur — the standard cheap blur — cannot skip empty pixels. It must iterate across the full framebuffer (750 × 560 = 420,000 pixels) in two passes (horizontal + vertical), reading
$$\frac{420{,}000 \times 2(2r+1)}{80{,}000 \times 5}