This summer, I worked on adding more attributes to ThinLTO's index to allow for more optimizations.
LTO (Link-time optimization) combines the bitcode of every module in a project into a monolithic bitcode file during linking, so IPO (interprocedural optimizations) can be applied throughout a whole program. This improves the quality and the speed of the executables generated. However, LTO consumes a large amount of memory, so it isn't viable on many larger C/C++ codebases.
ThinLTO addresses this problem by introducing a light-weight in-memory index, that tracks attributes for relevant global values in a program. The index can be used to propagate information across a whole program, import and internalize functions, and apply whole program optimizations without having to store all the bitcode in memory.
ThinLTO is faster than LTO, consumes less resources, and lends itself to distributed compilation. However, there are still optimizations that ThinLTO can't apply because the required attributes aren't yet in the index. I worked to add more of these attributes for GSoC.
First, I had to find which test cases I could use for comparison between ThinLTO and LTO generated executables (using the Benchmarks directory of the test-suite):
(sorted by largest difference between lto and thinlto runtime speeds)
test lto thin
./SciMark2-C/scimark2.test 34.173120000000004 37.7289
./ASC_Sequoia/IRSmk/IRSmk.test 4.29938 6.10558
./ASC_Sequoia/CrystalMk/CrystalMk.test 4.377740000000001 5.4902
./mafft/pairlocalalign.test 17.1773 17.73384
./VersaBench/8b10b/8b10b.test 3.29366 3.5631600000000008
./ASC_Sequoia/AMGmk/AMGmk.test 9.06768 9.32828
./Olden/em3d/em3d.test 3.8204599999999997 4.03294
./VersaBench/ecbdes/ecbdes.test 2.10274 2.2632199999999996
./TSVC/LinearDependence-flt/LinearDependence-flt.test 2.1580199999999996 2.25726
./TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl.test 6.81688 6.9129000000000005
./TSVC/StatementReordering-dbl/StatementReordering-dbl.test 2.87866 2.97036
./Fhourstones/fhourstones.test 0.77992 0.8680399999999999
./TSVC/Expansion-dbl/Expansion-dbl.test 2.40276 2.48862
./TSVC/Symbolics-flt/Symbolics-flt.test 0.97872 1.01974
./Bullet/bullet.test 4.60032 4.63818
./TSVC/Reductions-flt/Reductions-flt.test 4.92826 4.963900000000001
./TSVC/LoopRerolling-flt/LoopRerolling-flt.test 2.774 2.8075200000000002
./Prolangs-C++/life/life.test 1.08738 1.11896
./TSVC/ControlFlow-dbl/ControlFlow-dbl.test 4.1200600000000005 4.1431000000000004
./Olden/bh/bh.test 1.2872599999999998 1.30836
./Trimaran/netbench-url/netbench-url.test 2.1750999999999996 2.19426
./MallocBench/espresso/espresso.test 0.39128 0.40953999999999996
./NPB-serial/is/is.test 8.116200000000001 8.132380000000001
./Trimaran/enc-rc4/enc-rc4.test 0.5858599999999999 0.5984999999999999
./VersaBench/bmm/bmm.test 2.35784 2.3693599999999995
./sim/sim.test 2.9933 3.00376
./TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.test 3.0367 3.0470800000000002
./Olden/perimeter/perimeter.test 0.18706 0.19741999999999998
./TSVC/ControlLoops-flt/ControlLoops-flt.test 2.13158 2.1418399999999997
./TSVC/Equivalencing-dbl/Equivalencing-dbl.test 1.80118 1.8109000000000002
./VersaBench/beamformer/beamformer.test 0.77484 0.7838200000000001
./nbench/nbench.test 1.7093400000000003 1.71828
./TSVC/Recurrences-dbl/Recurrences-dbl.test 3.6061799999999997 3.6148199999999995
./Olden/tsp/tsp.test 0.7094 0.71542
./TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl.test 3.3644399999999997 3.36972
./FreeBench/distray/distray.test 0.10466 0.10918000000000001
./TSVC/Searching-flt/Searching-flt.test 2.2469 2.2511
./Olden/bisort/bisort.test 0.47929999999999995 0.48312
./MallocBench/cfrac/cfrac.test 0.9850200000000001 0.9884600000000001
./MiBench/consumer-typeset/consumer-typeset.test 0.10983999999999998 0.11300000000000002
./Trimaran/enc-3des/enc-3des.test 1.28424 1.28718
./MiBench/telecomm-CRC32/telecomm-CRC32.test 0.2762 0.2786
./llubenchmark/llu.test 6.75924 6.76164
./Olden/power/power.test 0.71944 0.7218
./mediabench/g721/g721encode/encode.test 0.02988 0.031979999999999995
./Olden/voronoi/voronoi.test 0.23626 0.23801999999999998
./MiBench/network-patricia/network-patricia.test 0.07289999999999999 0.07458000000000001
./FreeBench/fourinarow/fourinarow.test 0.22271999999999997 0.22432
./TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test 3.6113399999999998 3.61276
./MiBench/telecomm-FFT/telecomm-fft.test 0.03662 0.038040000000000004
./MiBench/consumer-lame/consumer-lame.test 0.1679 0.1692
./Olden/health/health.test 0.28254 0.28384
./McCat/17-bintr/bintr.test 0.0843 0.0854
./McCat/18-imp/imp.test 0.06532000000000002 0.06642
./MallocBench/gs/gs.test 0.03152 0.032600000000000004
./Trimaran/netbench-crc/netbench-crc.test 0.5355599999999999 0.53654
./FreeBench/analyzer/analyzer.test 0.06734 0.06832
./McCat/04-bisect/bisect.test 0.09903999999999999 0.09992000000000001
./Olden/mst/mst.test 0.07338 0.07405999999999999
./Trimaran/enc-md5/enc-md5.test 1.26304 1.2636999999999998
./mediabench/mpeg2/mpeg2dec/mpeg2decode.test 0.0073999999999999995 0.00804
./TSVC/InductionVariable-dbl/InductionVariable-dbl.test 4.00996 4.0106
./Prolangs-C/gnugo/gnugo.test 0.044199999999999996 0.044700000000000004
./mediabench/adpcm/rawcaudio/rawcaudio.test 0.00112 0.0015799999999999998
./Prolangs-C/bison/mybison.test 0.00152 0.00194
./MiBench/consumer-jpeg/consumer-jpeg.test 0.0031 0.00346
./Olden/treeadd/treeadd.test 0.19097999999999998 0.19129999999999997
./McCat/05-eks/eks.test 0.0027999999999999995 0.0030800000000000003
./FreeBench/pifft/pifft.test 0.10668 0.10695999999999999
./MiBench/network-dijkstra/network-dijkstra.test 0.03308 0.03332
./Prolangs-C/agrep/agrep.test 0.00262 0.0028200000000000005
./Prolangs-C++/city/city.test 0.00494 0.0051400000000000005
./MiBench/security-sha/security-sha.test 0.012680000000000002 0.012879999999999999
./mediabench/adpcm/rawdaudio/rawdaudio.test 0.0010999999999999998 0.0012799999999999999
./mediabench/gsm/toast/toast.test 0.0254 0.02556
./MiBench/telecomm-gsm/telecomm-gsm.test 0.21002 0.21017999999999998
./BitBench/drop3/drop3.test 0.24165999999999999 0.24180000000000001
./Prolangs-C++/employ/employ.test 0.00582 0.00592
./BitBench/uuencode/uuencode.test 0.0157 0.0157
./TSVC/Packing-flt/Packing-flt.test 2.59196 2.59192
./McCat/08-main/main.test 0.01992 0.019840000000000003
./MiBench/automotive-susan/automotive-susan.test 0.042859999999999995 0.04274000000000001
./BitBench/uudecode/uudecode.test 0.028559999999999995 0.02838
./Prolangs-C++/simul/simul.test 0.0061600000000000005 0.005980000000000001
./FreeBench/neural/neural.test 0.1026 0.10238
./McCat/03-testtrie/testtrie.test 0.0077399999999999995 0.007379999999999999
./mediabench/jpeg/jpeg-6a/cjpeg.test 0.00264 0.00224
./McCat/12-IOtest/iotest.test 0.12944000000000003 0.12883999999999998
./FreeBench/mason/mason.test 0.15511999999999998 0.15431999999999998
./Ptrdist/bc/bc.test 0.5263199999999999 0.52546
./McCat/01-qbsort/qbsort.test 0.053559999999999997 0.0526
./PAQ8p/paq8p.test 47.26171999999999 47.260099999999994
./Ptrdist/ft/ft.test 1.04444 1.0424399999999998
./Prolangs-C++/ocean/ocean.test 0.05891999999999999 0.05674
./FreeBench/pcompress2/pcompress2.test 0.11726 0.11438
./TSVC/Reductions-dbl/Reductions-dbl.test 2.2299200000000003 2.2268600000000003
./Fhourstones-3.1/fhourstones3.1.test 1.0269 1.02342
./McCat/09-vor/vor.test 0.09698 0.09332
./TSVC/Expansion-flt/Expansion-flt.test 1.5393000000000001 1.5355
./Ptrdist/ks/ks.test 1.14378 1.1399599999999999
./tramp3d-v4/tramp3d-v4.test 0.26248 0.25861999999999996
./TSVC/Packing-dbl/Packing-dbl.test 2.8963200000000002 2.89234
./Prolangs-C++/primes/primes.test 0.3233 0.31919999999999993
./TSVC/ControlLoops-dbl/ControlLoops-dbl.test 2.1750999999999996 2.17072
./MiBench/automotive-bitcount/automotive-bitcount.test 0.06721999999999999 0.062439999999999996
./Ptrdist/anagram/anagram.test 0.9008800000000001 0.89566
./TSVC/Recurrences-flt/Recurrences-flt.test 3.6074399999999995 3.6017
./BitBench/five11/five11.test 2.01152 2.00556
./MiBench/automotive-basicmath/automotive-basicmath.test 0.25332000000000005 0.24714
./TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test 6.46164 6.454420000000001
./TSVC/Searching-dbl/Searching-dbl.test 2.23234 2.2247999999999997
./TSVC/StatementReordering-flt/StatementReordering-flt.test 2.28326 2.27566
./TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 2.85818 2.8503200000000004
./Ptrdist/yacr2/yacr2.test 0.72272 0.7142000000000002
./TSVC/ControlFlow-flt/ControlFlow-flt.test 3.46012 3.45094
./7zip/7zip-benchmark.test 7.051 7.03934
./MiBench/security-rijndael/security-rijndael.test 0.0409 0.029180000000000005
./TSVC/Equivalencing-flt/Equivalencing-flt.test 1.1240999999999999 1.1116
./TSVC/LinearDependence-dbl/LinearDependence-dbl.test 2.8451999999999997 2.82726
./TSVC/IndirectAddressing-flt/IndirectAddressing-flt.test 3.0285 3.00964
./ASCI_Purple/SMG2000/smg2000.test 2.45564 2.43576
./TSVC/Symbolics-dbl/Symbolics-dbl.test 2.04544 2.02226
./TSVC/NodeSplitting-flt/NodeSplitting-flt.test 2.3962200000000005 2.3631800000000003
./TSVC/InductionVariable-flt/InductionVariable-flt.test 3.0465600000000004 3.0042
./VersaBench/dbms/dbms.test 1.0751 1.02504
./DOE-ProxyApps-C/XSBench/XSBench.test 3.0763399999999996 3.025
./TSVC/LoopRerolling-dbl/LoopRerolling-dbl.test 3.47882 3.39386
./TSVC/NodeSplitting-dbl/NodeSplitting-dbl.test 3.1038 3.01452
./TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt.test 1.7141000000000002 1.6207
./Trimaran/enc-pc1/enc-pc1.test 0.5199999999999999 0.21567999999999996
During the first few weeks, I spent most of my time trying to figure out what optimizations were being applied to LTO, but not to ThinLTO. After a few failed attempts (i.e. trying to compare the LTO and ThinLTO generated binaries directly, trying to diff the output of every single optimization pass), I found that the following method worked best:
- Pick the benchmark with the largest disparity between the LTO and ThinLTO runtime speed (where LTO was faster)
- Diff the results of
-stats(which shows statistics about which passes were able to apply optimizations) - Pick an IPO pass to focus on adding to ThinLTO
- Come up with a simple test case that shows how the pass isn't being fully applied in ThinLTO
Using this method, I found the following potential optimizations that could be added:
For the rest of GSoC I worked on implementing the FunctionAttrs pass which propagates information about functions throughout the callgraph (i.e. whether a function recurses, whether it reads or writes memory). This required adding the necessary attributes to the index, so that relevant attributes could be propagated during the thin-link step (which is the point in the process before code is generated, when just the index is available without the bitcode).
This required a bit of preliminary work. Since the index didn't have support for strongly connected component (SCC) navigation, I had to implement GraphTraits for the index. See patch: https://reviews.llvm.org/D36311
The problem with navigating the callgraph directly (without reducing it to a graph of SCCs), is that cycles (that are caused by recursion) are difficult to deal with. An SCC will reduce a cycle into a set of the connected nodes so that the graph can be traversed without worrying about direct or indirect recursion. It simplifies navigation of the whole program's reduced callgraph that's contained in the index to a simple post-order (bottom up) navigation of a DAG (directed acyclic graph).
See below image taken from from http://www.imm.dtu.dk/~hrni/PPA/slides6.pdf
Next, I implemented NoRecurse propagation and finalized another patch for modref information to propagate function attributes.
I plan to finish adding the FunctionAttrs pass then add the other optimizations listed above. They will require additional attributes to be added to the index including:
- Function arguments
- Return types
-
Add YAML information to Index and Add dump-summary command to llvm-lto2 tool - This patch will help simplify debugging and testing the ThinLTO index, since the current method is to dump the bitcode using
llvm-bcanalyzerwhich is brittle and difficult to use. The YAML format was partially implemented, so I finished up the implementation and added a subcommand to thellvm-lto2tool to give developers access to YAML dumping. -
You can see all my patches on my Phabricator profile
