Skip to content

Instantly share code, notes, and snippets.

@Wumpf
Last active August 10, 2024 18:57
Show Gist options
  • Save Wumpf/b72b6273c50f38c699b5995a8b5ba2fc to your computer and use it in GitHub Desktop.
Save Wumpf/b72b6273c50f38c699b5995a8b5ba2fc to your computer and use it in GitHub Desktop.

Wgpu degenerification performance report

TL;DR:

  • Compile times are 13-16% shorter if you have several backends (always on Windows), slight but consistent improvements if you use only a single one (default on Mac)
  • Runtime performance of compute passes gets worse by 10-30% (with a 40% worse outlier for a lot of dispatches in a single pass)
  • Runtime performance of render passes sometimes 17-34% better and sometimes 10% worse 🤷
  • Stuck doc-gen issue is fixed!
  • wgpu-info Windows binary size goes down, Mac binary sizes goes up unless you add more backends

All timings using rustc 1.76.0 (07dca489a 2024-02-04), cargo 1.76.0 (c84b36747 2024-01-18)

Test setups:

  • Windows: Windows 11 @ 7950X3D (16cores Ryzen9), RTX 4070, Vulkan for benchmarks
  • Mac: Mac OS 14.4.1, M1 Max (64gb)
    • Mac results were initially rather unstable, waited for it to be fully charged and rebooted. This made a huge difference in time lenghts & stability of results.

Some text editing occured while running tests, but I ensured no language servers etc. were running.

Debug Compile times

cargo clean && cargo build --timings -p wgpu

Codegen times after hypen.

trunk (34b0df277), Windows

Run 1:

total:       20.8s
- wgpu:       6.2s - 4.5s (72%)
- wgpu-core:  5.5s - 2.5s (45%)
- wgpu-hal:   4.6s - 1.9s (41%)

Run 2:

total:       20.6s
- wgpu:       6.2s - 4.5s (73%)
- wgpu-core:  5.4s - 2.4s (44%) 	
- wgpu-hal:   4.5s - 1.9s (41%)

Now (2023-08-10), Windows

Run 1:

total:       17.4s
- wgpu:       1.7s - 0.7s (40%)
- wgpu-core:  5.8s - 2.8s (48%)
- wgpu-hal:   4.8s - 1.9s (40%)

Run 2:

total:       17.2s
- wgpu:       1.7s - 0.7s (40%)
- wgpu-core:  5.7s - 2.6s (47%)
- wgpu-hal:   4.7s - 1.9s (40%)

trunk (34b0df277), Mac

Run 1:

total:       11.8s
- wgpu:       2.6s - 1.4s (56%)
- wgpu-core:  2.6s - 1.4s (56%)
- wgpu-hal:   3.4s - 0.9s (26%) 	

Run 2:

total:       12.0s
- wgpu:       2.7s - 1.4s (53%)
- wgpu-core:  3.4s - 1.0s (30%)
- wgpu-hal:   1.4s - 0.5s (34%)

with --all-features

total:       24.3s
- wgpu:       5.7s - 2.5s (60%)
- wgpu-core:  5.8s - 1.8s (30%)
- wgpu-hal:   3.7s - 1.3s (36%)

Now (2023-08-10), Mac

Run 1:

total:       11.3s
- wgpu:       1.4s - 0.5s (38%)
- wgpu-core:  4.3s - 1.7s (40%)
- wgpu-hal:   1.5s - 0.5s (29%)

Run 2:

total:       11.4s
- wgpu:       1.3s - 0.5s (38%)
- wgpu-core:  4.3s - 1.7s (40%)
- wgpu-hal:   1.6s - 0.5s (28%)

with --all-features

total:       21.0s
- wgpu:       1.5s - 0.6s (38%)
- wgpu-core:  6.4s - 2.3s (36%)
- wgpu-hal:   3.9s - 1.3s (34%)

Conclusions

  • Windows:
    • reduced compile times by 16% on default features
    • calculated as (1- (17.4 + 17.2) / (20.8 + 20.6))
  • Mac:
    • Little difference with default features, just about 4.6% reduced compile times
      • expected since there's less backends
    • --all-features very similar overall picture to Windows, shaving off 13.6% compile times in this run
  • wgpu compiles now significantly faster since it no longer has to monomorphize wgpu-core
  • wgpu-core is a bit slower now, likely because previously it passed some wgpu-hal monomorphization cost on (?)
  • As expected wgpu-hal is slightly slower now due to the added Dyn types & forwarding.
  • amount of cpu needed is actually lower than speed-up numbers indicate, since wgpu finishes now before wgpu-core!

##Runtime performance cargo bench -p wgpu-benchmark first on trunk (34b0df277), then on (2023-08-10)

Progressions & Regressions

Filtered out everything flagged by criterion with time changes over 5%. Numbers are changes in run time (minus good, plus bad).

Windows

  • -16.9% Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
  • -34.9% Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
  • +11.7% Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
  • +8.2% Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
  • +15.3% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
  • +15.6% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
  • +19.6% Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
  • +17.8% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
  • +27.2% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
  • +13.7% Computepass: Multi Threaded/2 threads x 5000 dispatch
  • +6.6% Computepass: Multi Threaded/8 threads x 1250 dispatch
  • +9.1% Computepass: Empty Submit with 60000 Resources

Full report Windows: TODO LINK

Mac

⚠️ something is broken with render pass benchmarks. They take excruciatingly long to finish on Mac I instead ran two sessions, one with compute and one with resource filter ß Our benchmarks run really poorly on Mac, the whole run took XX minutes to finish!

  • +13.403% Computepass: Single Threaded/1 computepasses x 10000 dispatches (Computepass Time)
  • +15.714% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
  • +13.439% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
  • +10.347% Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
  • +41.242% Computepass: Single Threaded/1 computepasses x 10000 dispatches (Submit Time)
  • +26.259% Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
  • +28.777% Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
  • +22.554% Computepass: Single Threaded/8 computepasses x 1250 dispatches (Submit Time)

Full report Mac: TODO LINK

Conclusions

Expected regressions but got also some progressions

  • Render pass times got better, render submit times got worse.
  • Compute pass times got worse overall
    • consistently so on Mac
  • Resource creation benchmarks are unaffected

Docs

cargo clean && cargo doc --timings -p wgpu

trunk (34b0df277), Windows

Using RUSTFLAGS="--cfg wgpu_core_doc on trunk didn't work for enabling docs for me, instead I simply removed the block in lib.rs that disables the docs for wgpu-core.

total: N/A It's stuck as expected (cancelled after 4:30min)

Now (2023-08-10), Windows

Run 1:

total:                 18.2s
- wgpu lib (doc):       1.7s
- wgpu-core lib (doc):  3.1s
- wgpu-hal lib (doc):   2.2s

Run 2:

total:                 19.1s
- wgpu lib (doc):       1.6s
- wgpu-core lib (doc):  3.3s
- wgpu-hal lib (doc):   2.2s

Conclusion

Fixes #4905

Release binary size

cargo clean && cargo build -p wgpu-info --release --timings With timings just for good measure ;-)

trunk (34b0df277), Windows

  • Time: 37.2s
  • Size: 7.14MiB

Now (2023-08-10), Windows

  • Time: 31.2s
  • Size: 6.01MiB

Now (2023-08-10), Mac

  • Time: 20.7s
  • Size: 3.9MiB

Now (2023-08-10), Mac

  • Time: 20.1s
  • Size: 4.9MiB

Conclusions

  • Windows: reduced binary size by 16%
  • Mac: binary size increased by 27% 😱
    • Why? A small increase makes sense because of added Dyn types in a single backend environment, but this is a lot.
    • Experiment: Using wgpu = { workspace = true, features = ["serde", "angle", "vulkan-portability"] }
      • Before: 9.0 MiB
      • After: 7.8MiB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment