Skip to content

Instantly share code, notes, and snippets.

@alexcrichton
Last active September 24, 2017 14:42
Show Gist options
  • Save alexcrichton/e05d16cf9038043a9c293be02e2f5bab to your computer and use it in GitHub Desktop.
Save alexcrichton/e05d16cf9038043a9c293be02e2f5bab to your computer and use it in GitHub Desktop.

Conclusions

  • We should enable multiple codegen units by default in debug mode. This halves the compile time for cargo
  • We should enable ThinLTO by default in release mode with multiple codegen units.
  • Overall, this may halve Rust compile times across the board

tl;dr; let's use multiple codegen units to enable parallel compilation. Let's also not lose performance in release mode with ThinLTO

This is the amount of time it takes to compile the `cargo` crate with the
following settings. LTO here is when ThinLTO is enabled for just the Cargo
crate itself.
cgus= 1 opt_level=2 lto=true Duration { secs: 126, nanos: 295528106 }
cgus= 1 opt_level=2 lto=false Duration { secs: 126, nanos: 413709122 }
cgus= 1 opt_level=3 lto=true Duration { secs: 124, nanos: 142468752 }
cgus= 1 opt_level=3 lto=false Duration { secs: 120, nanos: 743886461 }
cgus= 2 opt_level=2 lto=true Duration { secs: 93, nanos: 27925735 }
cgus= 2 opt_level=2 lto=false Duration { secs: 70, nanos: 59352787 }
cgus= 2 opt_level=3 lto=true Duration { secs: 96, nanos: 455173050 }
cgus= 2 opt_level=3 lto=false Duration { secs: 70, nanos: 290531058 }
cgus= 3 opt_level=2 lto=true Duration { secs: 63, nanos: 776553622 }
cgus= 3 opt_level=2 lto=false Duration { secs: 46, nanos: 874165672 }
cgus= 3 opt_level=3 lto=true Duration { secs: 63, nanos: 953188908 }
cgus= 3 opt_level=3 lto=false Duration { secs: 47, nanos: 541836063 }
cgus= 4 opt_level=2 lto=true Duration { secs: 61, nanos: 111946079 }
cgus= 4 opt_level=2 lto=false Duration { secs: 45, nanos: 426991057 }
cgus= 4 opt_level=3 lto=true Duration { secs: 62, nanos: 464272517 }
cgus= 4 opt_level=3 lto=false Duration { secs: 45, nanos: 969150948 }
cgus= 8 opt_level=2 lto=true Duration { secs: 55, nanos: 296671216 }
cgus= 8 opt_level=2 lto=false Duration { secs: 39, nanos: 349206076 }
cgus= 8 opt_level=3 lto=true Duration { secs: 56, nanos: 582326934 }
cgus= 8 opt_level=3 lto=false Duration { secs: 39, nanos: 808054774 }
cgus=16 opt_level=2 lto=true Duration { secs: 45, nanos: 830051522 }
cgus=16 opt_level=2 lto=false Duration { secs: 31, nanos: 280283218 }
cgus=16 opt_level=3 lto=true Duration { secs: 47, nanos: 504148287 }
cgus=16 opt_level=3 lto=false Duration { secs: 32, nanos: 703304794 }
cgus=32 opt_level=2 lto=true Duration { secs: 42, nanos: 300037233 }
cgus=32 opt_level=2 lto=false Duration { secs: 28, nanos: 80229071 }
cgus=32 opt_level=3 lto=true Duration { secs: 43, nanos: 666276767 }
cgus=32 opt_level=3 lto=false Duration { secs: 29, nanos: 218090742 }
This is a comparison of how long it takes to compile the `cargo` crate
in *debug* mode, using the specified number of codegen units.
cgus= 1 Duration { secs: 25, nanos: 987906757 }
cgus= 2 Duration { secs: 19, nanos: 634406885 }
cgus= 4 Duration { secs: 16, nanos: 760811018 }
cgus= 8 Duration { secs: 14, nanos: 575499872 }
cgus=16 Duration { secs: 14, nanos: 149127165 }
cgus=32 Duration { secs: 14, nanos: 495588126 }
This is the result of `cargo benchcmp` on the `regex` crate benchmark suite. This shows
the difference between one codegen unit and no ThinLTO (the default today) and
8 codegen units with ThinLTO enabled.
name cgu-1-lto-false ns/iter cgu-8-lto-true ns/iter diff ns/iter diff % speedup
misc::anchored_literal_long_non_match 19 (20526 MB/s) 26 (15000 MB/s) 7 36.84% x 0.73
misc::anchored_literal_short_match 22 (1181 MB/s) 24 (1083 MB/s) 2 9.09% x 0.92
misc::anchored_literal_short_non_match 18 (1444 MB/s) 26 (1000 MB/s) 8 44.44% x 0.69
misc::easy0_1K 15 (70066 MB/s) 16 (65687 MB/s) 1 6.67% x 0.94
misc::easy0_1MB 18 (58255722 MB/s) 20 (52430150 MB/s) 2 11.11% x 0.90
misc::easy0_32 15 (3933 MB/s) 17 (3470 MB/s) 2 13.33% x 0.88
misc::easy0_32K 15 (2186333 MB/s) 17 (1929117 MB/s) 2 13.33% x 0.88
misc::literal 15 (3400 MB/s) 14 (3642 MB/s) -1 -6.67% x 1.07
misc::match_class 62 (1306 MB/s) 66 (1227 MB/s) 4 6.45% x 0.94
misc::medium_1K 15 (70133 MB/s) 17 (61882 MB/s) 2 13.33% x 0.88
misc::medium_1MB 18 (58255777 MB/s) 21 (49933523 MB/s) 3 16.67% x 0.86
misc::medium_32 16 (3750 MB/s) 17 (3529 MB/s) 1 6.25% x 0.94
misc::medium_32K 15 (2186400 MB/s) 17 (1929176 MB/s) 2 13.33% x 0.88
misc::replace_all 163 175 12 7.36% x 0.93
misc::reverse_suffix_no_quadratic 5,230 (1529 MB/s) 4,198 (1905 MB/s) -1,032 -19.73% x 1.25
regexdna::subst1 895,716 (5675 MB/s) 825,029 (6161 MB/s) -70,687 -7.89% x 1.09
sherlock::name_alt2 112,522 (5287 MB/s) 119,522 (4977 MB/s) 7,000 6.22% x 0.94
sherlock::name_alt3 123,715 (4808 MB/s) 130,415 (4561 MB/s) 6,700 5.42% x 0.95
sherlock::name_alt5 116,989 (5085 MB/s) 123,665 (4810 MB/s) 6,676 5.71% x 0.95
This is the result of `cargo benchcmp` on the `regex` crate benchmark suite. This shows
the difference between one codegen unit and no ThinLTO (the default today) and
16 codegen units with ThinLTO enabled.
name cgu-1-lto-false ns/iter cgu-16-lto-true ns/iter diff ns/iter diff % speedup
misc::anchored_literal_long_non_match 19 (20526 MB/s) 26 (15000 MB/s) 7 36.84% x 0.73
misc::anchored_literal_short_non_match 18 (1444 MB/s) 27 (962 MB/s) 9 50.00% x 0.67
misc::easy0_1K 15 (70066 MB/s) 17 (61823 MB/s) 2 13.33% x 0.88
misc::easy0_1MB 18 (58255722 MB/s) 19 (55189631 MB/s) 1 5.56% x 0.95
misc::easy0_32 15 (3933 MB/s) 16 (3687 MB/s) 1 6.67% x 0.94
misc::easy0_32K 15 (2186333 MB/s) 16 (2049687 MB/s) 1 6.67% x 0.94
misc::hard_1K 60 (17516 MB/s) 64 (16421 MB/s) 4 6.67% x 0.94
misc::hard_32 60 (983 MB/s) 64 (921 MB/s) 4 6.67% x 0.94
misc::hard_32K 60 (546583 MB/s) 64 (512421 MB/s) 4 6.67% x 0.94
misc::literal 15 (3400 MB/s) 14 (3642 MB/s) -1 -6.67% x 1.07
misc::medium_1K 15 (70133 MB/s) 16 (65750 MB/s) 1 6.67% x 0.94
misc::medium_1MB 18 (58255777 MB/s) 20 (52430200 MB/s) 2 11.11% x 0.90
misc::medium_32K 15 (2186400 MB/s) 16 (2049750 MB/s) 1 6.67% x 0.94
misc::replace_all 163 181 18 11.04% x 0.90
misc::reverse_suffix_no_quadratic 5,230 (1529 MB/s) 4,197 (1906 MB/s) -1,033 -19.75% x 1.25
regexdna::subst1 895,716 (5675 MB/s) 815,747 (6231 MB/s) -79,969 -8.93% x 1.10
sherlock::name_alt2 112,522 (5287 MB/s) 119,102 (4995 MB/s) 6,580 5.85% x 0.94
sherlock::name_alt3 123,715 (4808 MB/s) 129,966 (4577 MB/s) 6,251 5.05% x 0.95
sherlock::name_alt5 116,989 (5085 MB/s) 123,255 (4826 MB/s) 6,266 5.36% x 0.95
sherlock::repeated_class_negation 79,450,010 (7 MB/s) 85,060,104 (6 MB/s) 5,610,094 7.06% x 0.93
@alexcrichton
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment