TL;DR:
- Compile times are 13-16% shorter if you have several backends (always on Windows), slight but consistent improvements if you use only a single one (default on Mac)
- Runtime performance of compute passes gets worse by 10-30% (with a 40% worse outlier for a lot of dispatches in a single pass)
- Runtime performance of render passes sometimes 17-34% better and sometimes 10% worse 🤷
- Stuck doc-gen issue is fixed!
wgpu-info
Windows binary size goes down, Mac binary sizes goes up unless you add more backends
All timings using rustc 1.76.0 (07dca489a 2024-02-04)
, cargo 1.76.0 (c84b36747 2024-01-18)
Test setups:
- Windows: Windows 11 @ 7950X3D (16cores Ryzen9), RTX 4070, Vulkan for benchmarks
- Mac: Mac OS 14.4.1, M1 Max (64gb)
- Mac results were initially rather unstable, waited for it to be fully charged and rebooted. This made a huge difference in time lenghts & stability of results.
Some text editing occured while running tests, but I ensured no language servers etc. were running.
cargo clean && cargo build --timings -p wgpu
Codegen times after hypen.
Run 1:
total: 20.8s
- wgpu: 6.2s - 4.5s (72%)
- wgpu-core: 5.5s - 2.5s (45%)
- wgpu-hal: 4.6s - 1.9s (41%)
Run 2:
total: 20.6s
- wgpu: 6.2s - 4.5s (73%)
- wgpu-core: 5.4s - 2.4s (44%)
- wgpu-hal: 4.5s - 1.9s (41%)
Run 1:
total: 17.4s
- wgpu: 1.7s - 0.7s (40%)
- wgpu-core: 5.8s - 2.8s (48%)
- wgpu-hal: 4.8s - 1.9s (40%)
Run 2:
total: 17.2s
- wgpu: 1.7s - 0.7s (40%)
- wgpu-core: 5.7s - 2.6s (47%)
- wgpu-hal: 4.7s - 1.9s (40%)
Run 1:
total: 11.8s
- wgpu: 2.6s - 1.4s (56%)
- wgpu-core: 2.6s - 1.4s (56%)
- wgpu-hal: 3.4s - 0.9s (26%)
Run 2:
total: 12.0s
- wgpu: 2.7s - 1.4s (53%)
- wgpu-core: 3.4s - 1.0s (30%)
- wgpu-hal: 1.4s - 0.5s (34%)
with --all-features
total: 24.3s
- wgpu: 5.7s - 2.5s (60%)
- wgpu-core: 5.8s - 1.8s (30%)
- wgpu-hal: 3.7s - 1.3s (36%)
Run 1:
total: 11.3s
- wgpu: 1.4s - 0.5s (38%)
- wgpu-core: 4.3s - 1.7s (40%)
- wgpu-hal: 1.5s - 0.5s (29%)
Run 2:
total: 11.4s
- wgpu: 1.3s - 0.5s (38%)
- wgpu-core: 4.3s - 1.7s (40%)
- wgpu-hal: 1.6s - 0.5s (28%)
with --all-features
total: 21.0s
- wgpu: 1.5s - 0.6s (38%)
- wgpu-core: 6.4s - 2.3s (36%)
- wgpu-hal: 3.9s - 1.3s (34%)
- Windows:
- reduced compile times by 16% on default features
- calculated as (1- (17.4 + 17.2) / (20.8 + 20.6))
- Mac:
- Little difference with default features, just about 4.6% reduced compile times
- expected since there's less backends
--all-features
very similar overall picture to Windows, shaving off 13.6% compile times in this run
- Little difference with default features, just about 4.6% reduced compile times
- wgpu compiles now significantly faster since it no longer has to monomorphize
wgpu-core
wgpu-core
is a bit slower now, likely because previously it passed some wgpu-hal monomorphization cost on (?)- As expected
wgpu-hal
is slightly slower now due to the addedDyn
types & forwarding. - amount of cpu needed is actually lower than speed-up numbers indicate, since
wgpu
finishes now beforewgpu-core
!
##Runtime performance
cargo bench -p wgpu-benchmark
first on trunk
(34b0df277
), then on (2023-08-10)
Filtered out everything flagged by criterion with time changes over 5%. Numbers are changes in run time (minus good, plus bad).
- -16.9%
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
- -34.9%
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
- +11.7%
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
- +8.2%
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
- +15.3%
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
- +15.6%
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
- +19.6%
Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
- +17.8%
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
- +27.2%
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
- +13.7%
Computepass: Multi Threaded/2 threads x 5000 dispatch
- +6.6%
Computepass: Multi Threaded/8 threads x 1250 dispatch
- +9.1%
Computepass: Empty Submit with 60000 Resources
Full report Windows: TODO LINK
compute
and one with resource
filter
ß
Our benchmarks run really poorly on Mac, the whole run took XX minutes to finish!
- +13.403%
Computepass: Single Threaded/1 computepasses x 10000 dispatches (Computepass Time)
- +15.714%
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
- +13.439%
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
- +10.347%
Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
- +41.242%
Computepass: Single Threaded/1 computepasses x 10000 dispatches (Submit Time)
- +26.259%
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
- +28.777%
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
- +22.554%
Computepass: Single Threaded/8 computepasses x 1250 dispatches (Submit Time)
Full report Mac: TODO LINK
Expected regressions but got also some progressions
- Render pass times got better, render submit times got worse.
- Compute pass times got worse overall
- consistently so on Mac
- Resource creation benchmarks are unaffected
cargo clean && cargo doc --timings -p wgpu
Using RUSTFLAGS="--cfg wgpu_core_doc
on trunk didn't work for enabling docs for me,
instead I simply removed the block in lib.rs
that disables the docs for wgpu-core.
total: N/A It's stuck as expected (cancelled after 4:30min)
Run 1:
total: 18.2s
- wgpu lib (doc): 1.7s
- wgpu-core lib (doc): 3.1s
- wgpu-hal lib (doc): 2.2s
Run 2:
total: 19.1s
- wgpu lib (doc): 1.6s
- wgpu-core lib (doc): 3.3s
- wgpu-hal lib (doc): 2.2s
Fixes #4905
cargo clean && cargo build -p wgpu-info --release --timings
With timings just for good measure ;-)
- Time: 37.2s
- Size: 7.14MiB
- Time: 31.2s
- Size: 6.01MiB
- Time: 20.7s
- Size: 3.9MiB
- Time: 20.1s
- Size: 4.9MiB
- Windows: reduced binary size by 16%
- Mac: binary size increased by 27% 😱
- Why? A small increase makes sense because of added
Dyn
types in a single backend environment, but this is a lot. - Experiment: Using
wgpu = { workspace = true, features = ["serde", "angle", "vulkan-portability"] }
- Before: 9.0 MiB
- After: 7.8MiB
- Why? A small increase makes sense because of added