Wgpu degenerification performance report

TL;DR:

Compile times are 13-16% shorter if you have several backends (always on Windows), slight but consistent improvements if you use only a single one (default on Mac)
Runtime performance of compute passes gets worse by 10-30% (with a 40% worse outlier for a lot of dispatches in a single pass)
Runtime performance of render passes sometimes 17-34% better and sometimes 10% worse 🤷
Stuck doc-gen issue is fixed!
wgpu-info Windows binary size goes down, Mac binary sizes goes up unless you add more backends

All timings using rustc 1.76.0 (07dca489a 2024-02-04), cargo 1.76.0 (c84b36747 2024-01-18)

Test setups:

Windows: Windows 11 @ 7950X3D (16cores Ryzen9), RTX 4070, Vulkan for benchmarks
Mac: Mac OS 14.4.1, M1 Max (64gb)
- Mac results were initially rather unstable, waited for it to be fully charged and rebooted. This made a huge difference in time lenghts & stability of results.

Some text editing occured while running tests, but I ensured no language servers etc. were running.

Debug Compile times

cargo clean && cargo build --timings -p wgpu

Codegen times after hypen.

`trunk` (`34b0df277`), Windows

Run 1:

total:       20.8s
- wgpu:       6.2s - 4.5s (72%)
- wgpu-core:  5.5s - 2.5s (45%)
- wgpu-hal:   4.6s - 1.9s (41%)

Run 2:

total:       20.6s
- wgpu:       6.2s - 4.5s (73%)
- wgpu-core:  5.4s - 2.4s (44%) 	
- wgpu-hal:   4.5s - 1.9s (41%)

Now (2023-08-10), Windows

Run 1:

total:       17.4s
- wgpu:       1.7s - 0.7s (40%)
- wgpu-core:  5.8s - 2.8s (48%)
- wgpu-hal:   4.8s - 1.9s (40%)

Run 2:

total:       17.2s
- wgpu:       1.7s - 0.7s (40%)
- wgpu-core:  5.7s - 2.6s (47%)
- wgpu-hal:   4.7s - 1.9s (40%)

`trunk` (`34b0df277`), Mac

Run 1:

total:       11.8s
- wgpu:       2.6s - 1.4s (56%)
- wgpu-core:  2.6s - 1.4s (56%)
- wgpu-hal:   3.4s - 0.9s (26%)

Run 2:

total:       12.0s
- wgpu:       2.7s - 1.4s (53%)
- wgpu-core:  3.4s - 1.0s (30%)
- wgpu-hal:   1.4s - 0.5s (34%)

with --all-features

total:       24.3s
- wgpu:       5.7s - 2.5s (60%)
- wgpu-core:  5.8s - 1.8s (30%)
- wgpu-hal:   3.7s - 1.3s (36%)

Now (2023-08-10), Mac

Run 1:

total:       11.3s
- wgpu:       1.4s - 0.5s (38%)
- wgpu-core:  4.3s - 1.7s (40%)
- wgpu-hal:   1.5s - 0.5s (29%)

Run 2:

total:       11.4s
- wgpu:       1.3s - 0.5s (38%)
- wgpu-core:  4.3s - 1.7s (40%)
- wgpu-hal:   1.6s - 0.5s (28%)

with --all-features

total:       21.0s
- wgpu:       1.5s - 0.6s (38%)
- wgpu-core:  6.4s - 2.3s (36%)
- wgpu-hal:   3.9s - 1.3s (34%)

Conclusions

Windows:
- reduced compile times by 16% on default features
- calculated as (1- (17.4 + 17.2) / (20.8 + 20.6))
Mac:
- Little difference with default features, just about 4.6% reduced compile times
  - expected since there's less backends
- --all-features very similar overall picture to Windows, shaving off 13.6% compile times in this run
wgpu compiles now significantly faster since it no longer has to monomorphize wgpu-core
wgpu-core is a bit slower now, likely because previously it passed some wgpu-hal monomorphization cost on (?)
As expected wgpu-hal is slightly slower now due to the added Dyn types & forwarding.
amount of cpu needed is actually lower than speed-up numbers indicate, since wgpu finishes now before wgpu-core!

##Runtime performance cargo bench -p wgpu-benchmark first on trunk (34b0df277), then on (2023-08-10)

Progressions & Regressions

Filtered out everything flagged by criterion with time changes over 5%. Numbers are changes in run time (minus good, plus bad).

Windows

-16.9% Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
-34.9% Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
+11.7% Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
+8.2% Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
+15.3% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
+15.6% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
+19.6% Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
+17.8% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
+27.2% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
+13.7% Computepass: Multi Threaded/2 threads x 5000 dispatch
+6.6% Computepass: Multi Threaded/8 threads x 1250 dispatch
+9.1% Computepass: Empty Submit with 60000 Resources

Full report Windows: TODO LINK

Mac

⚠️ something is broken with render pass benchmarks. They take excruciatingly long to finish on Mac I instead ran two sessions, one with compute and one with resource filter ß Our benchmarks run really poorly on Mac, the whole run took XX minutes to finish!

+13.403% Computepass: Single Threaded/1 computepasses x 10000 dispatches (Computepass Time)
+15.714% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
+13.439% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
+10.347% Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
+41.242% Computepass: Single Threaded/1 computepasses x 10000 dispatches (Submit Time)
+26.259% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
+28.777% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
+22.554% Computepass: Single Threaded/8 computepasses x 1250 dispatches (Submit Time)

Full report Mac: TODO LINK

Conclusions

Expected regressions but got also some progressions

Render pass times got better, render submit times got worse.
Compute pass times got worse overall
- consistently so on Mac
Resource creation benchmarks are unaffected

Docs

cargo clean && cargo doc --timings -p wgpu

`trunk` (`34b0df277`), Windows

Using RUSTFLAGS="--cfg wgpu_core_doc on trunk didn't work for enabling docs for me, instead I simply removed the block in lib.rs that disables the docs for wgpu-core.

total: N/A It's stuck as expected (cancelled after 4:30min)

Now (2023-08-10), Windows

Run 1:

total:                 18.2s
- wgpu lib (doc):       1.7s
- wgpu-core lib (doc):  3.1s
- wgpu-hal lib (doc):   2.2s

Run 2:

total:                 19.1s
- wgpu lib (doc):       1.6s
- wgpu-core lib (doc):  3.3s
- wgpu-hal lib (doc):   2.2s

Conclusion

Fixes #4905

Release binary size

cargo clean && cargo build -p wgpu-info --release --timings With timings just for good measure ;-)

`trunk` (`34b0df277`), Windows

Time: 37.2s
Size: 7.14MiB

Now (2023-08-10), Windows

Time: 31.2s
Size: 6.01MiB

Now (2023-08-10), Mac

Time: 20.7s
Size: 3.9MiB

Now (2023-08-10), Mac

Time: 20.1s
Size: 4.9MiB

Conclusions

Windows: reduced binary size by 16%
Mac: binary size increased by 27% 😱
- Why? A small increase makes sense because of added Dyn types in a single backend environment, but this is a lot.
- Experiment: Using wgpu = { workspace = true, features = ["serde", "angle", "vulkan-portability"] }
  - Before: 9.0 MiB
  - After: 7.8MiB

Wumpf/wgpu degenerification report.md

Wgpu degenerification performance report

Debug Compile times

trunk (34b0df277), Windows

Now (2023-08-10), Windows

trunk (34b0df277), Mac

Now (2023-08-10), Mac

Conclusions

Progressions & Regressions

Windows

Mac

Conclusions

Docs

trunk (34b0df277), Windows

Now (2023-08-10), Windows

Conclusion

Release binary size

trunk (34b0df277), Windows

Now (2023-08-10), Windows

Now (2023-08-10), Mac

Now (2023-08-10), Mac

Conclusions

`trunk` (`34b0df277`), Windows

`trunk` (`34b0df277`), Mac

`trunk` (`34b0df277`), Windows

`trunk` (`34b0df277`), Windows