Last active
October 27, 2022 20:40
-
-
Save AndreVallestero/ccea42e83011e9cd45452107090e2867 to your computer and use it in GitHub Desktop.
PROPELLER: Profile Guided Optimizing Large Scale LLVM-based Relinker
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f8f9fa;font:13px arial,sans-serif;text-align:left;color:#202124;border:0;margin:0;border-bottom:1px solid #dadce0;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#681da8}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style></head> | |
<body class="vsc-initialized" vlink="blue" link="blue" bgcolor="#ffffff"><div id="bN015htcoyT__google-cache-hdr"> | |
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> | |
<meta name="Producer" content="Skia/PDF m79"> | |
<title>PROPELLER: Profile Guided Optimizing Large Scale LLVM-based Relinker</title> | |
<table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="1"><b>Page 1</b></a></font></td></tr></tbody></table><font size="5" face="Times"><span style="font-size:36px;font-family:Times"> | |
<div style="position:absolute;top:285;left:336"><nobr>PROPELLER:</nobr></div> | |
<div style="position:absolute;top:343;left:125"><nobr><b>Pr</b>ofile Guided <b>Op</b>timizing <b>L</b>arge Scale</nobr></div> | |
<div style="position:absolute;top:397;left:270"><nobr><b>L</b>LVM-based <b>R</b>elinker</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:508;left:108"><nobr>Background</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:580;left:108"><nobr>We recently evaluated Facebook’s BOLT, a Post Link Optimizer framework, on large google</nobr></div> | |
</span></font> | |
<font size="2" face="Times"><span style="font-size:7px;font-family:Times"> | |
<div style="position:absolute;top:579;left:514"><nobr>1</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:602;left:108"><nobr>benchmarks and noticed that it improves key performance metrics of these benchmarks by 2%</nobr></div> | |
<div style="position:absolute;top:625;left:108"><nobr>to 6%, which is pretty impressive as this is over and above a baseline binary already heavily</nobr></div> | |
<div style="position:absolute;top:647;left:108"><nobr>optimized with ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of</nobr></div> | |
<div style="position:absolute;top:670;left:108"><nobr>binaries optimized via <font color="#1155cc"><a href="https://reviews.llvm.org/D54175">Context-Sensitive PGO</a></font>.</nobr></div> | |
<div style="position:absolute;top:670;left:481"><nobr>While ThinLTO + PGO is also profile guided</nobr></div> | |
<div style="position:absolute;top:692;left:108"><nobr>and does very aggressive performance optimizations, there is more room for performance</nobr></div> | |
<div style="position:absolute;top:715;left:108"><nobr>improvements due to profile approximations while applying the transformations. BOLT uses</nobr></div> | |
<div style="position:absolute;top:737;left:108"><nobr>exact profiles from the final binary and is able to fill the gaps left by ThinLTO + PGO. The</nobr></div> | |
<div style="position:absolute;top:760;left:108"><nobr>performance improvements due to BOLT come from basic block layout, function reordering and</nobr></div> | |
<div style="position:absolute;top:782;left:108"><nobr>function splitting.</nobr></div> | |
<div style="position:absolute;top:827;left:108"><nobr>While BOLT does an excellent job of squeezing extra performance from highly optimized</nobr></div> | |
<div style="position:absolute;top:850;left:108"><nobr>binaries with optimizations such as code layout, it has these major issues:</nobr></div> | |
<div style="position:absolute;top:895;left:135"><nobr>1. It does not take advantage of distributed build systems.</nobr></div> | |
<div style="position:absolute;top:917;left:135"><nobr>2. It has scalability issues and to rewrite a binary with a ~300M text segment size:</nobr></div> | |
<div style="position:absolute;top:940;left:189"><nobr>a. Memory foot-print is 70G.</nobr></div> | |
<div style="position:absolute;top:962;left:189"><nobr>b. It takes more than 10 minutes to rewrite the binary.</nobr></div> | |
<div style="position:absolute;top:1007;left:108"><nobr>Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes</nobr></div> | |
<div style="position:absolute;top:1030;left:108"><nobr>and rewrites the final binary in one process. This limits the scalability of BOLT and the memory</nobr></div> | |
<div style="position:absolute;top:1052;left:108"><nobr>and time overhead shoots up quickly for large binaries.</nobr></div> | |
<div style="position:absolute;top:1097;left:108"><nobr>Inspired by the performance gains and to address the scalability issue of BOLT, we went about</nobr></div> | |
<div style="position:absolute;top:1120;left:108"><nobr>designing a scalable infrastructure that can perform BOLT-like post-link optimizations. In this</nobr></div> | |
<div style="position:absolute;top:1142;left:108"><nobr>RFC, we discuss our system, “Propeller”, which can perform profile guided link time binary</nobr></div> | |
</span></font> | |
<font size="2" face="Times"><span style="font-size:7px;font-family:Times"> | |
<div style="position:absolute;top:1202;left:108"><nobr>1 <font style="font-size:12px">A Post Link optimizer optimizes a binary or a shared object directly based on its run-time profile without</font></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:12px;font-family:Times"> | |
<div style="position:absolute;top:1220;left:108"><nobr>re-invoking the compiler. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:1363;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="2"><b>Page 2</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:1472;left:108"><nobr>optimizations in a scalable way and is friendly to distributed build systems. Our system</nobr></div> | |
<div style="position:absolute;top:1494;left:108"><nobr>leverages the existing capabilities of the compiler tool-chain and is not a stand alone tool. Like</nobr></div> | |
<div style="position:absolute;top:1517;left:108"><nobr>BOLT, our system boosts the performance of optimized binaries via link-time optimizations</nobr></div> | |
<div style="position:absolute;top:1539;left:108"><nobr>using accurate profiles of the binary. We discuss the Propeller system and show how to do the</nobr></div> | |
<div style="position:absolute;top:1562;left:108"><nobr>whole program basic block layout using Propeller.</nobr></div> | |
<div style="position:absolute;top:1607;left:108"><nobr>Propeller does whole program basic block layout at link time via basic block sections. We have</nobr></div> | |
<div style="position:absolute;top:1629;left:108"><nobr>added support for having each basic block in its own section which allows the linker to do</nobr></div> | |
<div style="position:absolute;top:1652;left:108"><nobr>arbitrary reorderings of basic blocks to achieve any desired fine-grain code layout which</nobr></div> | |
<div style="position:absolute;top:1674;left:108"><nobr>includes block layout, function splitting and function reordering.</nobr></div> | |
<div style="position:absolute;top:1719;left:108"><nobr>Our experiments on large real-world applications and SPEC with code layout show that</nobr></div> | |
<div style="position:absolute;top:1742;left:108"><nobr>Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time</nobr></div> | |
<div style="position:absolute;top:1764;left:108"><nobr>overhead.</nobr></div> | |
<div style="position:absolute;top:1809;left:108"><nobr>An LLVM branch with propeller patches is available in the git repository here:</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc"> | |
<div style="position:absolute;top:1832;left:108"><nobr><a href="https://github.com/google/llvm-propeller/">https://github.com/google/llvm-propeller/ </a><font color="#000000">We will upload patches for review for the various</font></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:1854;left:108"><nobr>elements.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:1907;left:108"><nobr>Quick Start - How to optimize further with Propeller?</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:1978;left:108"><nobr><b># git clone and build repo </b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:2000;left:108"><nobr><b>$</b> <b><font style="font-size:14px">cd $LLVM_DIR && </font></b><font style="font-size:14px"></font><b><font style="font-size:12px">git clone </font></b><font style="font-size:12px"></font><b><font style="font-size:12px"><a href="https://github.com/google/llvm-propeller.git">https://github.com/google/llvm-propeller.git</a></font><font style="font-size:14px"><a href="https://github.com/google/llvm-propeller.git"></a> </font></b></nobr></div> | |
<div style="position:absolute;top:2030;left:108"><nobr><b>$</b> <b><font style="font-size:14px">mkdir $BUILD_DIR && cd $BUILD_DIR </font></b></nobr></div> | |
<div style="position:absolute;top:2059;left:108"><nobr><b>$</b> <b><font style="font-size:14px">cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:2088;left:128"><nobr><b>$LLVM_DIR/llvm-propeller/llvm </b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:2111;left:108"><nobr><b>$</b> <b><font style="font-size:14px">ninja </font></b></nobr></div> | |
<div style="position:absolute;top:2140;left:108"><nobr><b>$</b> <b><font style="font-size:14px">export PATH=$BUILD_DIR/bin:$PATH </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:2169;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:2192;left:108"><nobr><b># Let’s Propeller-optimize the following program: </b></nobr></div> | |
</span></font> | |
<div style="position:absolute;top:2551;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="3"><b>Page 3</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:2921;left:815"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:2943;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:2966;left:108"><nobr><b># Step 1: Build the peak optimized binary with an additional flag. </b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:2988;left:108"><nobr><b>$</b> <b><font style="font-size:14px">clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:3017;left:128"><nobr><b>-fuse-ld=lld </b></nobr></div> | |
<div style="position:absolute;top:3040;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:3062;left:108"><nobr><b># Step 2: Profile the binary, only one side of the branch is executed. </b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:3085;left:108"><nobr><b>$</b> <b><font style="font-size:14px">perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:3114;left:118"><nobr><b>/dev/null </b></nobr></div> | |
<div style="position:absolute;top:3137;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:3159;left:108"><nobr><b># Step 3: Convert the profiles using the tool provided </b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:3182;left:108"><nobr><b>$</b> <b><font style="font-size:14px">$LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:3211;left:128"><nobr><b>--binary=./a.out.labels --profile=perf.data --out=perf.propeller </b></nobr></div> | |
<div style="position:absolute;top:3233;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:3256;left:108"><nobr><b># Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller </b></nobr></div> | |
<div style="position:absolute;top:3278;left:108"><nobr><b># flag changed. </b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:3301;left:108"><nobr><b>$</b> <b><font style="font-size:14px">clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:3330;left:118"><nobr><b>-fuse-ld=lld </b></nobr></div> | |
<div style="position:absolute;top:3353;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:3376;left:108"><nobr>In Step 4, the optimized bit code can be used if it is saved in Step1 as Propeller is active only</nobr></div> | |
<div style="position:absolute;top:3399;left:108"><nobr>during compile backend and link.</nobr></div> | |
<div style="position:absolute;top:3444;left:108"><nobr>The optimized binary has a different layout of the basic blocks in main to keep the executed</nobr></div> | |
<div style="position:absolute;top:3466;left:108"><nobr>blocks together and split the cold blocks. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:3739;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="4"><b>Page 4</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:3875;left:108"><nobr>Details of the various Steps to optimize with Propeller</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:3931;left:108"><nobr>Step1 : Build the peak-optimized binary with an additional flag</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:3987;left:108"><nobr><b>$</b> <b><font style="font-size:14px">clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:4016;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:4040;left:108"><nobr>This step builds an -O2 version, kept simple for illustration but could be using PGO and to</nobr></div> | |
<div style="position:absolute;top:4063;left:108"><nobr>obtain a highly optimized binary.</nobr></div> | |
<div style="position:absolute;top:4108;left:108"><nobr>The flag -fpropeller-label invokes the following flags:</nobr></div> | |
<div style="position:absolute;top:4151;left:135"><nobr><b>1. -fbasicblock-sections=labels </b></nobr></div> | |
<div style="position:absolute;top:4174;left:135"><nobr><b>2. -funique-internal-funcnames </b></nobr></div> | |
<div style="position:absolute;top:4196;left:135"><nobr><b>3. --Wl,--lto-basicblock-sections=labels (with LTO) </b></nobr></div> | |
<div style="position:absolute;top:4243;left:108"><nobr>These flags build the binary with labels for every basic block and unique names for all internal</nobr></div> | |
<div style="position:absolute;top:4265;left:108"><nobr>linkage functions.</nobr></div> | |
<div style="position:absolute;top:4333;left:108"><nobr>A quick inspection of the symbol table for the binary shows the following basic blocks sorted by</nobr></div> | |
<div style="position:absolute;top:4355;left:108"><nobr>virtual address for main and callee:</nobr></div> | |
<div style="position:absolute;top:4399;left:108"><nobr><b>00000000002010f0 T main </b></nobr></div> | |
<div style="position:absolute;top:4421;left:108"><nobr><b>0000000000201110 t a.BB.main </b></nobr></div> | |
<div style="position:absolute;top:4444;left:108"><nobr><b>0000000000201124 t aa.BB.main </b></nobr></div> | |
<div style="position:absolute;top:4466;left:108"><nobr><b>0000000000201150 T _Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:4489;left:108"><nobr><b>0000000000201156 t a.BB._Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:4511;left:108"><nobr><b>0000000000201167 t aa.BB._Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:4534;left:108"><nobr><b>0000000000201176 t aaa.BB._Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:4580;left:108"><nobr>The basic block names (unary) are chosen to allow tail merging of strings and keep strtab bloats</nobr></div> | |
<div style="position:absolute;top:4603;left:108"><nobr>minimal while retaining the name of the original function.</nobr></div> | |
<div style="position:absolute;top:4648;left:108"><nobr>Further, it is more efficient to split this step into two and save the optimized high level LLVM IR</nobr></div> | |
<div style="position:absolute;top:4670;left:108"><nobr>as it can be directly used in Step 4.</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:4717;left:108"><nobr>Step 2: Obtain PMU profiles with LBR sampling</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:4751;left:108"><nobr>Notice that the binary is created to run the loop an arbitrary number of times. Sample the binary</nobr></div> | |
<div style="position:absolute;top:4774;left:108"><nobr>with a huge trip count to get a meaningful number of profiles, with the <font color="#1155cc"><a href="https://perf.wiki.kernel.org/index.php/Tutorial">perf</a></font></nobr></div> | |
<div style="position:absolute;top:4796;left:108"><nobr>(<font color="#1155cc"><a href="https://perf.wiki.kernel.org/index.php/Tutorial">https://perf.wiki.kernel.org/index.php/Tutorial</a></font>) utility: </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:4927;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="5"><b>Page 5</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:5057;left:108"><nobr><b>$</b> <b><font style="font-size:14px">perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 >& \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:5086;left:118"><nobr><b>/dev/null </b></nobr></div> | |
<div style="position:absolute;top:5133;left:108"><nobr>The profiles only exercise one side of the branch in the loop, the true branch, with even number</nobr></div> | |
<div style="position:absolute;top:5155;left:108"><nobr>of arguments.</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:5202;left:108"><nobr>Step 3: Convert the profiles with the tool provided</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:5257;left:108"><nobr><b>$</b> <b><font style="font-size:14px">$LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:5287;left:128"><nobr><b>--binary=./a.out.labels --profile=perf.data --out=perf.propeller </b></nobr></div> | |
<div style="position:absolute;top:5333;left:108"><nobr>The create_llvm_prof tool is the same tool used by AutoFDO to convert profiles and we have</nobr></div> | |
<div style="position:absolute;top:5355;left:108"><nobr>extended it for Propeller. The patched tool is part of the git repo and we use it to convert the</nobr></div> | |
<div style="position:absolute;top:5378;left:108"><nobr>raw PMU profiles into a format used by Propeller.</nobr></div> | |
<div style="position:absolute;top:5423;left:108"><nobr>The conversion does the following. The symbol table from the optimized binary is used to</nobr></div> | |
<div style="position:absolute;top:5445;left:108"><nobr>associate profile samples (virtual addresses) to basic blocks. The textual representation shows</nobr></div> | |
<div style="position:absolute;top:5468;left:108"><nobr>the execution profiles for all the basic blocks using their respective labels which makes it easy</nobr></div> | |
<div style="position:absolute;top:5490;left:108"><nobr>for the propeller optimizations to annotate CFGs.</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:5537;left:108"><nobr>Step 4 : Re-optimize with Propeller</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:18px;font-family:Times"> | |
<div style="position:absolute;top:5593;left:108"><nobr><b>$</b> <b><font style="font-size:14px">clang++ -O2 sample.cc -fpropeller-optimize=perf.propeller \ </font></b></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:5622;left:128"><nobr><b>-fuse-ld=lld </b></nobr></div> | |
<div style="position:absolute;top:5668;left:108"><nobr>Here the .cc file can be replaced with optimized .ll file from Step 1 if cached. We are working on</nobr></div> | |
<div style="position:absolute;top:5691;left:108"><nobr>a simple patch to easily bypass the LLVM optimization passes when the IR is detected to be</nobr></div> | |
<div style="position:absolute;top:5713;left:108"><nobr>optimized. The optimization options are exactly the same as in Step 1 except the Propeller</nobr></div> | |
<div style="position:absolute;top:5736;left:108"><nobr>flags.</nobr></div> | |
<div style="position:absolute;top:5781;left:108"><nobr>The flag -fpropeller-optimze=perf.propeller invokes the following flags:</nobr></div> | |
<div style="position:absolute;top:5824;left:135"><nobr><b>1. -fbasicblock-sections=perf.propeller </b></nobr></div> | |
<div style="position:absolute;top:5847;left:135"><nobr><b>2. -funique-internal-funcnames </b></nobr></div> | |
<div style="position:absolute;top:5869;left:135"><nobr><b>3. -Wl,--propeller=perf.propeller </b></nobr></div> | |
<div style="position:absolute;top:5892;left:135"><nobr><b>4. -Wl,--optimize-bb-jumps </b></nobr></div> | |
<div style="position:absolute;top:5914;left:135"><nobr><b>5. -Wl,--lto-basicblock-sections=perf.propeller (with LTO) </b></nobr></div> | |
<div style="position:absolute;top:5961;left:108"><nobr>The <i>“-fbasicblock-sections=” </i><font style="font-size:15px"></font>flag generates basic block sections for functions which have</nobr></div> | |
<div style="position:absolute;top:5983;left:108"><nobr>samples as indicated in perf.propeller. Basic block sections are explained in detail later. The</nobr></div> | |
</span></font> | |
<div style="position:absolute;top:6115;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="6"><b>Page 6</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:6224;left:108"><nobr><i>“-funique-internal-funcnames” </i>flag generates unique names for functions with internal-linkage.</nobr></div> | |
<div style="position:absolute;top:6246;left:108"><nobr>The <i>“-Wl,--propeller=” </i>and <i>“-Wl,--optimize-bb-jumps” </i>flags trigger propeller optimizations and</nobr></div> | |
<div style="position:absolute;top:6269;left:108"><nobr>linker relaxation at link time.</nobr></div> | |
<div style="position:absolute;top:6314;left:108"><nobr>The propeller interface in the linker constructs the CFG and annotates it with profiles. The block</nobr></div> | |
<div style="position:absolute;top:6336;left:108"><nobr>reordering algorithm discovers the optimal layout of basic blocks and the linker reorders it to</nobr></div> | |
<div style="position:absolute;top:6359;left:108"><nobr>form the final binary. Debug Information and CFI for basic block sections are created with</nobr></div> | |
<div style="position:absolute;top:6381;left:108"><nobr>appropriate relocations so that the linker can create correct debuginfo and CFI for reordered</nobr></div> | |
<div style="position:absolute;top:6404;left:108"><nobr>functions.</nobr></div> | |
<div style="position:absolute;top:6426;left:108"><nobr>Add the option <i>“-Wl,--propeller-keep-named-symbols” </i>to the above build and notice how the</nobr></div> | |
<div style="position:absolute;top:6449;left:108"><nobr>final binary has a different ordering of basic blocks where the executed ones are kept close</nobr></div> | |
<div style="position:absolute;top:6471;left:108"><nobr>together:</nobr></div> | |
<div style="position:absolute;top:6538;left:108"><nobr><b>0000000000201000 T main </b></nobr></div> | |
<div style="position:absolute;top:6560;left:108"><nobr><b>0000000000201018 t a.BB.main </b></nobr></div> | |
<div style="position:absolute;top:6583;left:108"><nobr><b>0000000000201030 T _Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:6605;left:108"><nobr><b>0000000000201036 t a.BB._Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:6628;left:108"><nobr><b>0000000000201045 t aaa.BB._Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:6650;left:108"><nobr><b>000000000020104e t aa.BB.main </b></nobr></div> | |
<div style="position:absolute;top:6673;left:108"><nobr><b>0000000000201057 t aa.BB._Z6calleeb </b></nobr></div> | |
<div style="position:absolute;top:6695;left:108"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:6719;left:108"><nobr>By default, without option <b>-Wl,--propeller-keep-named-symbols, </b>all hot basic block</nobr></div> | |
<div style="position:absolute;top:6741;left:108"><nobr>section symbols are deleted as they are part of the main function and the first basic block of</nobr></div> | |
<div style="position:absolute;top:6764;left:108"><nobr>every function partition (cold partition) is retained.</nobr></div> | |
<div style="position:absolute;top:6809;left:108"><nobr>Function <i>main </i>has been split and the basic blocks <i>aa.BB.main </i>and <i>aa.BB._Z6calleeb </i>which are</nobr></div> | |
<div style="position:absolute;top:6831;left:108"><nobr>cold have been moved out and the executed basic blocks are kept together.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:6884;left:108"><nobr>BOLT versus Propeller</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:6949;left:108"><nobr>Design</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:7006;left:135"><nobr>1. The input to BOLT is an optimized binary and its hardware PMU profiles. BOLT</nobr></div> | |
<div style="position:absolute;top:7028;left:162"><nobr>disassembles the binary, optimizes, and writes out a new optimized binary. Propeller</nobr></div> | |
<div style="position:absolute;top:7051;left:162"><nobr>takes a different approach and re-links the binary from cached IR objects to achieve</nobr></div> | |
<div style="position:absolute;top:7073;left:162"><nobr>scalability. The input to Propeller is the optimized LLVM bit-code object files and the</nobr></div> | |
<div style="position:absolute;top:7096;left:162"><nobr>hardware profiles, and Propeller re-generates the native object files by invoking the</nobr></div> | |
<div style="position:absolute;top:7118;left:162"><nobr>compiler backends, in parallel or distributed, and re-links the object files to form the new</nobr></div> | |
<div style="position:absolute;top:7141;left:162"><nobr>optimized binary. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:7303;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="7"><b>Page 7</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:7412;left:135"><nobr>2. BOLT is a stand-alone LLVM based tool whereas Propeller is implemented as part of the</nobr></div> | |
<div style="position:absolute;top:7434;left:162"><nobr>existing LLVM tool-chain. Propeller leverages the tool chain’s existing capabilities for</nobr></div> | |
<div style="position:absolute;top:7457;left:162"><nobr>backend compilations and linking whereas BOLT uses LLVM APIs for disassembly,</nobr></div> | |
<div style="position:absolute;top:7479;left:162"><nobr>parsing debug information, and linking (e.g., JIT API to do the final link) to re-build the</nobr></div> | |
<div style="position:absolute;top:7502;left:162"><nobr>binary post optimizations. </nobr></div> | |
<div style="position:absolute;top:7524;left:135"><nobr>3. BOLT requires that all static relocations be retained in the final binary to aid</nobr></div> | |
<div style="position:absolute;top:7547;left:162"><nobr>dis-assembly whereas Propeller requires basic block labels to be emitted in the symbol</nobr></div> | |
<div style="position:absolute;top:7569;left:162"><nobr>table. Both static relocations and basic block labels increase the size of the final binary</nobr></div> | |
<div style="position:absolute;top:7592;left:162"><nobr>but do not affect performance as they are not loaded onto memory during execution.</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:7661;left:108"><nobr>Scalability</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:7718;left:135"><nobr>1. BOLT does not work well with distributed systems as it directly works on the input binary</nobr></div> | |
<div style="position:absolute;top:7740;left:162"><nobr>in a single process. Propeller is designed to play well with distributed systems and</nobr></div> | |
<div style="position:absolute;top:7763;left:162"><nobr>achieves this by splitting the optimization process into two parts, a distributed part and a</nobr></div> | |
<div style="position:absolute;top:7785;left:162"><nobr>whole-program part. </nobr></div> | |
<div style="position:absolute;top:7808;left:135"><nobr>2. Propeller handles incremental builds much better than BOLT. Small source changes,</nobr></div> | |
<div style="position:absolute;top:7830;left:162"><nobr>which do not affect the profile information significantly, only require re-compiling the</nobr></div> | |
<div style="position:absolute;top:7853;left:162"><nobr>affected IR objects, and then re-linking with the existing profiles to form the Propeller</nobr></div> | |
<div style="position:absolute;top:7875;left:162"><nobr>optimized binary. With BOLT, even small source changes invalidate the profile</nobr></div> | |
<div style="position:absolute;top:7898;left:162"><nobr>information, an error if the binary’s build id does not match the profile. Hence, the binary</nobr></div> | |
<div style="position:absolute;top:7920;left:162"><nobr>corresponding to the source change must be re-built and re-profiled. More importantly,</nobr></div> | |
<div style="position:absolute;top:7943;left:162"><nobr>the heavyweight process of disassembly, analysis, and rewrite by BOLT must be</nobr></div> | |
<div style="position:absolute;top:7965;left:162"><nobr>repeated.</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:8034;left:108"><nobr>Debug Information Handling</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:8091;left:108"><nobr>BOLT must parse DWARF debug information (DebugInfo) and Call Frame Information (CFI) and</nobr></div> | |
<div style="position:absolute;top:8114;left:108"><nobr>rewrite them according to the optimizations applied. Since BOLT parses CFI, it is able to rewrite</nobr></div> | |
<div style="position:absolute;top:8136;left:108"><nobr>the CFI descriptors compactly after the optimizations are applied. With Propeller, the changes</nobr></div> | |
<div style="position:absolute;top:8159;left:108"><nobr>needed for DebugInfo and CFI are made by the compiler with relocations created for the linker</nobr></div> | |
<div style="position:absolute;top:8181;left:108"><nobr>to patch. The linker does not have to parse DebugInfo and CFI and this is consistent with how</nobr></div> | |
<div style="position:absolute;top:8204;left:108"><nobr>the LLVM tool-chain fixes DebugInfo and CFI with function sections. The CFI information is</nobr></div> | |
<div style="position:absolute;top:8226;left:108"><nobr>created by the compiler with a very conservative assumption that no two basic blocks from the</nobr></div> | |
<div style="position:absolute;top:8249;left:108"><nobr>same function will be adjacent. This makes CFI unnecessarily bloated in the final binary. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:8491;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="8"><b>Page 8</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:8701;left:108"><nobr>Can this be done with PGO itself in the compiler?</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:8751;left:108"><nobr><i><b>TLDR: No, because of lack of precise context and path sensitive profiles.</b></i></nobr></div> | |
<div style="position:absolute;top:8796;left:108"><nobr>This question can be rephrased as how does BOLT/Propeller squeeze more performance from</nobr></div> | |
<div style="position:absolute;top:8818;left:108"><nobr>an already highly optimized PGO+LTO binary?</nobr></div> | |
<div style="position:absolute;top:8863;left:108"><nobr>Binaries built for peak performance use either instrumented PGO or AutoFDO, along with</nobr></div> | |
<div style="position:absolute;top:8886;left:108"><nobr>LTO/ThinLTO for cross-module optimizations. It has been noticed that there are lots of</nobr></div> | |
<div style="position:absolute;top:8908;left:108"><nobr>instances where profile information is either inaccurate or absent due to the various</nobr></div> | |
<div style="position:absolute;top:8931;left:108"><nobr>transformations applied. While the profile counts are precise at the moment they are read, the</nobr></div> | |
<div style="position:absolute;top:8953;left:108"><nobr>compiler cannot maintain the precision of the profiles after some transformations. The profiles</nobr></div> | |
<div style="position:absolute;top:8976;left:108"><nobr>are inaccurate due to lack of context sensitive and path sensitive information and also due to</nobr></div> | |
<div style="position:absolute;top:8998;left:108"><nobr>cross-module interactions. Examples of such profile inconsistencies include but not limited to</nobr></div> | |
<div style="position:absolute;top:9021;left:108"><nobr>the following:</nobr></div> | |
<div style="position:absolute;top:9066;left:135"><nobr>1. A hot function foo not inlined in the instrumented binary does not contain the context</nobr></div> | |
<div style="position:absolute;top:9088;left:162"><nobr>sensitive profile of each of its call sites and the basic-block ordering in every PGO inlined</nobr></div> | |
<div style="position:absolute;top:9111;left:162"><nobr>call-site will end up being similar.</nobr></div> | |
<div style="position:absolute;top:9133;left:135"><nobr>2. A function foo defined in module A will not have its profile updated and corrected when it</nobr></div> | |
<div style="position:absolute;top:9156;left:162"><nobr>is inlined in module B.</nobr></div> | |
<div style="position:absolute;top:9178;left:135"><nobr>3. With instrumentation based profiling, the instrumentation pass that introduces counters</nobr></div> | |
<div style="position:absolute;top:9201;left:162"><nobr>that accumulate edge profiles happens early and a later pass that introduces new</nobr></div> | |
<div style="position:absolute;top:9223;left:162"><nobr>branches or additional control-flow is also introducing potential loss of profile information</nobr></div> | |
<div style="position:absolute;top:9246;left:162"><nobr>due to path sensitivity. </nobr></div> | |
<div style="position:absolute;top:9268;left:135"><nobr>4. Also, with instrumentation based profiling, the instrumentation pass happens before the</nobr></div> | |
<div style="position:absolute;top:9291;left:162"><nobr>second inlining pass and any function inlined in the second pass will not have context</nobr></div> | |
<div style="position:absolute;top:9313;left:162"><nobr>sensitive profile information collected.</nobr></div> | |
<div style="position:absolute;top:9336;left:135"><nobr>5. Loop optimization passes like loop peeling, loop rotation and loop unrolling introduce</nobr></div> | |
<div style="position:absolute;top:9358;left:162"><nobr>new branches whose probabilities cannot be accurately determined and are guessed</nobr></div> | |
<div style="position:absolute;top:9381;left:162"><nobr>using aggressive heuristics.</nobr></div> | |
<div style="position:absolute;top:9403;left:135"><nobr>6. Profile updates after applying jump threading optimizations are made using heuristics.</nobr></div> | |
<div style="position:absolute;top:9426;left:135"><nobr>7. Profile updates after branch lowering of logical “OR (||)” and “AND (&&)” instructions are</nobr></div> | |
<div style="position:absolute;top:9448;left:162"><nobr>also made using heuristics.</nobr></div> | |
<div style="position:absolute;top:9493;left:108"><nobr>Recently, Context-Sensitive PGO (CSPGO) was invented to augment regular PGO with context</nobr></div> | |
<div style="position:absolute;top:9516;left:108"><nobr>sensitive profiles and study the effects on performance. CSPGO adds another round of profiling</nobr></div> | |
<div style="position:absolute;top:9538;left:108"><nobr>to PGO by instrumenting the optimized PGO binary built with the first set of profiles. Just like in</nobr></div> | |
</span></font> | |
<div style="position:absolute;top:9679;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="9"><b>Page 9</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:9788;left:108"><nobr>PGO, the first set of profiles still drives the inlining decisions which makes the second set of</nobr></div> | |
<div style="position:absolute;top:9810;left:108"><nobr>profiles much more precise with respect to context sensitive information. Experiments show that</nobr></div> | |
<div style="position:absolute;top:9833;left:108"><nobr>CSPGO can further improve the performance of PGO optimized binaries by a few percent which</nobr></div> | |
<div style="position:absolute;top:9855;left:108"><nobr>is strong evidence that PGO drops performance on the table due to the lack of precise profiles.</nobr></div> | |
<div style="position:absolute;top:9878;left:108"><nobr>Note that while CSPGO addresses the problem of lack of context sensitive profiles, it still is</nobr></div> | |
<div style="position:absolute;top:9900;left:108"><nobr>subjected to imprecise path sensitive profile information like PGO.</nobr></div> | |
<div style="position:absolute;top:9945;left:108"><nobr>Both BOLT and Propeller re-profile the peak optimized binary and use the profiles to directly fix</nobr></div> | |
<div style="position:absolute;top:9968;left:108"><nobr>the binary. Since the compiler is not invoked, BOLT/Propeller are always operating with</nobr></div> | |
<div style="position:absolute;top:9990;left:108"><nobr>accurate profiles and are able to fix some of the inaccuracies in some of the optimizations</nobr></div> | |
<div style="position:absolute;top:10013;left:108"><nobr>performed by PGO like basic block layout. Context-Sensitive PGO (CSPGO) aims to address</nobr></div> | |
<div style="position:absolute;top:10035;left:108"><nobr>the exact same problem and does well in improving the performance of PGO binaries further but</nobr></div> | |
<div style="position:absolute;top:10058;left:108"><nobr>our experiments show that BOLT and Propeller are even able to further optimize CSPGO</nobr></div> | |
<div style="position:absolute;top:10080;left:108"><nobr>binaries.</nobr></div> | |
<div style="position:absolute;top:10125;left:108"><nobr>Notice that Propeller does not need to re-invoke the high level IR optimizations as it tries to</nobr></div> | |
<div style="position:absolute;top:10148;left:108"><nobr>re-layout the final code. Propeller can also bypass MIR but complete support for serializing MIR</nobr></div> | |
<div style="position:absolute;top:10170;left:108"><nobr>is not available yet. We are looking at adding this support to improve the efficiency of Propeller.</nobr></div> | |
<div style="position:absolute;top:10193;left:108"><nobr>Propeller needs to re-run CFI instruction insertion and code generation in the backend as the</nobr></div> | |
<div style="position:absolute;top:10215;left:108"><nobr>optimization pass uses basic block sections which needs better CFI handling.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:10268;left:108"><nobr>Basic Block Sections</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:10340;left:108"><nobr>Modern compilers support compiling with function and data sections via options</nobr></div> | |
<div style="position:absolute;top:10363;left:108"><nobr>-ffunction-sections and -fdata-sections, respectively, which places each function and data item</nobr></div> | |
<div style="position:absolute;top:10385;left:108"><nobr>in a separate section in the native object file. This allows the linker to do many link time</nobr></div> | |
<div style="position:absolute;top:10408;left:108"><nobr>optimizations like dead data and code elimination, identical code folding, and function</nobr></div> | |
<div style="position:absolute;top:10430;left:108"><nobr>reordering. Without function sections, performing these optimizations at link time is significantly</nobr></div> | |
<div style="position:absolute;top:10453;left:108"><nobr>harder. Linkers operate at the granularity of sections, and this paper introduces basic block</nobr></div> | |
<div style="position:absolute;top:10475;left:108"><nobr>sections which compiles every basic block into its own section. With this support, arbitrary</nobr></div> | |
<div style="position:absolute;top:10498;left:108"><nobr>reordering of basic blocks at link time is feasible. We introduce a new compiler option,</nobr></div> | |
<div style="position:absolute;top:10520;left:108"><nobr>-fbasicblock-sections, which places every basic block in a unique ELF text section in the object</nobr></div> | |
<div style="position:absolute;top:10543;left:108"><nobr>file along with a symbol labelling the basic block. The linker can then order the basic block</nobr></div> | |
<div style="position:absolute;top:10565;left:108"><nobr>sections in any arbitrary sequence which when done correctly can encapsulate block layout,</nobr></div> | |
<div style="position:absolute;top:10588;left:108"><nobr>function layout and function splitting optimizations. However, there are a couple of challenges to</nobr></div> | |
<div style="position:absolute;top:10610;left:108"><nobr>be addressed for this to be feasible:</nobr></div> | |
<div style="position:absolute;top:10655;left:135"><nobr>1. The compiler must not allow any implicit fall-through between any two adjacent basic</nobr></div> | |
<div style="position:absolute;top:10678;left:162"><nobr>blocks as they could be reordered at link time to be non-adjacent. In other words, the</nobr></div> | |
<div style="position:absolute;top:10700;left:162"><nobr>compiler must make a fall-through between adjacent basic blocks explicit by retaining</nobr></div> | |
<div style="position:absolute;top:10723;left:162"><nobr>the direct jump instruction that jumps to the next basic block. These branches can only</nobr></div> | |
</span></font> | |
<div style="position:absolute;top:10867;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="10"><b>Page 10</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:10976;left:162"><nobr>be removed later in the linking phase after the final ordering is performed as determined</nobr></div> | |
<div style="position:absolute;top:10998;left:162"><nobr>by Propeller. </nobr></div> | |
<div style="position:absolute;top:11021;left:135"><nobr>2. Each additional section added to an object file bloats its size by tens of bytes. The</nobr></div> | |
<div style="position:absolute;top:11043;left:162"><nobr>number of basic blocks can be potentially very large compared to the size of functions</nobr></div> | |
<div style="position:absolute;top:11066;left:162"><nobr>and can bloat object sizes significantly. For instance, the clang binary contains 1.5M</nobr></div> | |
<div style="position:absolute;top:11088;left:162"><nobr>basic blocks from approximately 700K functions. </nobr></div> | |
<div style="position:absolute;top:11111;left:135"><nobr>3. All inter-basic block branch targets would now need to be resolved by the linker as they</nobr></div> | |
<div style="position:absolute;top:11133;left:162"><nobr>cannot be calculated during compile time. This is done using static relocations which</nobr></div> | |
<div style="position:absolute;top:11156;left:162"><nobr>bloats the size of the object files. Further, the compiler tries to use short branch</nobr></div> | |
<div style="position:absolute;top:11178;left:162"><nobr>instructions on some ISAs for branch offsets that can be accommodated in one byte.</nobr></div> | |
<div style="position:absolute;top:11201;left:162"><nobr>This is not possible with basic block sections as the offset is not determined at compile</nobr></div> | |
<div style="position:absolute;top:11223;left:162"><nobr>time, and long branch instructions have to be used everywhere. </nobr></div> | |
<div style="position:absolute;top:11246;left:135"><nobr>4. Debug Information (DebugInfo) and Call Frame Information (CFI) emission needs</nobr></div> | |
<div style="position:absolute;top:11268;left:162"><nobr>special handling with basic block sections. DebugInfo needs to be emitted with more</nobr></div> | |
<div style="position:absolute;top:11291;left:162"><nobr>relocations as basic block sections can break a function into potentially several disjoint</nobr></div> | |
<div style="position:absolute;top:11313;left:162"><nobr>pieces, and CFI needs to be emitted per basic block. This also bloats the object file and</nobr></div> | |
<div style="position:absolute;top:11336;left:162"><nobr>binary sizes significantly.</nobr></div> | |
<div style="position:absolute;top:11358;left:135"><nobr>5. The linker needs to perform a relaxation pass on all the branch instructions after laying</nobr></div> | |
<div style="position:absolute;top:11381;left:162"><nobr>out basic blocks. This relaxation removes explicit fall-through branches between</nobr></div> | |
<div style="position:absolute;top:11403;left:162"><nobr>adjacent basic blocks and shrinks jump instructions whose offsets can be</nobr></div> | |
<div style="position:absolute;top:11426;left:162"><nobr>accommodated in smaller equivalent instructions.</nobr></div> | |
<div style="position:absolute;top:11426;left:570"><nobr>Side note: RISC <font style="font-size:13px">ISAs generally</font></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:13px;font-family:Times"> | |
<div style="position:absolute;top:11426;left:697"><nobr> </nobr></div> | |
<div style="position:absolute;top:11426;left:737"><nobr> </nobr></div> | |
<div style="position:absolute;top:11426;left:809"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:162"><nobr>support only small offsets, and in order to jump to a distant place, you have to construct an</nobr></div> | |
<div style="position:absolute;top:11448;left:216"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:250"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:293"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:351"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:382"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:441"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:501"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:587"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:633"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:663"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:700"><nobr> </nobr></div> | |
<div style="position:absolute;top:11448;left:788"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:162"><nobr>address in a register (using a few instructions) and then jump to that location. The proposed</nobr></div> | |
<div style="position:absolute;top:11469;left:218"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:305"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:353"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:395"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:487"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:517"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:552"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:591"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:640"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:704"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:739"><nobr> </nobr></div> | |
<div style="position:absolute;top:11469;left:809"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:162"><nobr>scheme (emitting most generic instruction sequence for a branch and let the linker to relax)</nobr></div> | |
<div style="position:absolute;top:11490;left:218"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:286"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:327"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:383"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:463"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:536"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:560"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:627"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:658"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:706"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:748"><nobr> </nobr></div> | |
<div style="position:absolute;top:11490;left:809"><nobr> </nobr></div> | |
<div style="position:absolute;top:11512;left:162"><nobr>would work for RISC without having to create thunks.</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:11556;left:108"><nobr>Propeller overcomes the size bloats incurred by creating basic block sections on demand.</nobr></div> | |
<div style="position:absolute;top:11579;left:108"><nobr>Propeller does not create basic block sections for the set of functions that did not correspond to</nobr></div> | |
<div style="position:absolute;top:11601;left:108"><nobr>any sample in the profile. This helps in greatly limiting the number of basic blocks for which</nobr></div> | |
<div style="position:absolute;top:11624;left:108"><nobr>sections are created, and our experiments show that for our large benchmarks more than 90 %</nobr></div> | |
<div style="position:absolute;top:11646;left:108"><nobr>of basic blocks were excluded. Call Frame Information (CFI), which is very useful for stack</nobr></div> | |
<div style="position:absolute;top:11669;left:108"><nobr>unwinding, poses a significant challenge with basic block sections. Unlike debug information,</nobr></div> | |
<div style="position:absolute;top:11691;left:108"><nobr>CFI does not support non-contiguous ranges and a lot of the information needs to be duplicated,</nobr></div> | |
<div style="position:absolute;top:11714;left:108"><nobr>leading to size bloats in object files and the final binary. Our experiments show that CFI is</nobr></div> | |
<div style="position:absolute;top:11736;left:108"><nobr>responsible for the largest overall size bloats (10% on average). We later describe some</nobr></div> | |
<div style="position:absolute;top:11759;left:108"><nobr>techniques used to keep this minimal and propose changes to the DWARF format to reduce the</nobr></div> | |
<div style="position:absolute;top:11781;left:108"><nobr>CFI related size bloat even further. In spite of the challenges above, basic block sections allow</nobr></div> | |
<div style="position:absolute;top:11804;left:108"><nobr>Propeller to do code layout optimizations in a scalable manner with much less memory and time</nobr></div> | |
<div style="position:absolute;top:11826;left:108"><nobr>overheads when compared to BOLT. The cost of creating basic block sections can be</nobr></div> | |
<div style="position:absolute;top:11849;left:108"><nobr>distributed/parallelized as this happens in the backends when each IR object file is assembled</nobr></div> | |
<div style="position:absolute;top:11871;left:108"><nobr>into the native object file. With this feature, the analysis for determining the right basic block</nobr></div> | |
<div style="position:absolute;top:11894;left:108"><nobr>sequence and reordering the blocks can be done more efficiently by the linker. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:12055;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="11"><b>Page 11</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:12191;left:108"><nobr>Labeling Basic Blocks</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:12255;left:108"><nobr>Propeller labels every basic block with a unique symbol as this allows easy mapping of virtual</nobr></div> | |
<div style="position:absolute;top:12277;left:108"><nobr>addresses from PMU profiles back to the corresponding basic blocks. Since the number of basic</nobr></div> | |
<div style="position:absolute;top:12300;left:108"><nobr>blocks is large, the labeling bloats the symbol table sizes and the string table sizes significantly.</nobr></div> | |
<div style="position:absolute;top:12322;left:108"><nobr>While the binary size does increase this does not affect performance as the symbol table is not</nobr></div> | |
<div style="position:absolute;top:12345;left:108"><nobr>loaded in memory during run-time. The string table size bloat is however kept very minimal</nobr></div> | |
<div style="position:absolute;top:12367;left:108"><nobr>using a unary naming scheme that uses string suffix compression. The basic blocks for function</nobr></div> | |
<div style="position:absolute;top:12390;left:108"><nobr>foo are named "a.bb.foo", "aa.bb.foo", . . . This turns out to be very good for string table sizes</nobr></div> | |
<div style="position:absolute;top:12412;left:108"><nobr>and the bloat in the string table size for a very large binary is only 8 %.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:12484;left:108"><nobr>Linker Relaxation after reordering basic blocks</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:12526;left:108"><nobr>After the linker has reordered the basic block sections according to the desired sequence, it</nobr></div> | |
<div style="position:absolute;top:12549;left:108"><nobr>runs a relaxation pass to optimize jump instructions. Currently, the compiler emits the long form</nobr></div> | |
<div style="position:absolute;top:12571;left:108"><nobr>of all jump instructions . AMD64 ISA supports variants of jump instructions with one byte offset</nobr></div> | |
<div style="position:absolute;top:12594;left:108"><nobr>or a four byte offset. The compiler generates jump instructions with R_X86_64 32-bit PC</nobr></div> | |
<div style="position:absolute;top:12616;left:108"><nobr>relative relocations. We would like to use a new relocation type for these jump instructions as</nobr></div> | |
<div style="position:absolute;top:12639;left:108"><nobr>it makes it easy and accurate while relaxing these instructions.</nobr></div> | |
<div style="position:absolute;top:12684;left:108"><nobr>The relaxation pass does three things: </nobr></div> | |
<div style="position:absolute;top:12729;left:108"><nobr>First, it deletes all explicit fall-through direct jump instructions between adjacent basic blocks.</nobr></div> | |
<div style="position:absolute;top:12751;left:108"><nobr>This is done by discarding the tail of the basic block section. </nobr></div> | |
<div style="position:absolute;top:12796;left:108"><nobr>Second, it shrinks jump instructions whose offsets can fit in a smaller equivalent jump</nobr></div> | |
<div style="position:absolute;top:12819;left:108"><nobr>instruction. The AMD64 ISA supports variants of jump instructions with either a one byte offset</nobr></div> | |
<div style="position:absolute;top:12841;left:108"><nobr>or a four byte offset. Shorter jump instructions are preferred where possible to reduce code</nobr></div> | |
<div style="position:absolute;top:12864;left:108"><nobr>size. The jump instructions are shrunk by using jump relocations. Jump relocations are used to</nobr></div> | |
<div style="position:absolute;top:12886;left:108"><nobr>modify the opcode of the jump instruction. Jump relocations contain three values, instruction</nobr></div> | |
<div style="position:absolute;top:12909;left:108"><nobr>offset, jump type and size. There is a one to one correspondence between jump relocations and</nobr></div> | |
<div style="position:absolute;top:12931;left:108"><nobr>jump instructions. While writing this jump instruction out to the final binary, the linker uses the</nobr></div> | |
<div style="position:absolute;top:12954;left:108"><nobr>jump relocation to determine the opcode and the size of the modified jump instruction. These</nobr></div> | |
<div style="position:absolute;top:12976;left:108"><nobr>new relocations are required because the input object files are memory-mapped without write</nobr></div> | |
<div style="position:absolute;top:12999;left:108"><nobr>permissions and directly modifying the object files requires copying these sections. Copying a</nobr></div> | |
<div style="position:absolute;top:13021;left:108"><nobr>large number of basic block sections significantly bloats memory and we have invented jump</nobr></div> | |
<div style="position:absolute;top:13044;left:108"><nobr>relocations as a simple solution to avoid this bloat. </nobr></div> | |
<div style="position:absolute;top:13089;left:108"><nobr>Third, the relaxation pass replaces a two jump instruction sequence, JX and JY , with just one</nobr></div> | |
<div style="position:absolute;top:13111;left:108"><nobr>jump instruction if JX is a conditional branch whose target is the adjacent basic block and JY is</nobr></div> | |
</span></font> | |
<div style="position:absolute;top:13243;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="12"><b>Page 12</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:13352;left:108"><nobr>a direct branch. This can be done by replacing the two jump instructions with a single jump</nobr></div> | |
<div style="position:absolute;top:13374;left:108"><nobr>instruction which has as the inverse op-code of JX as its op-code, and the target of JY as its</nobr></div> | |
<div style="position:absolute;top:13397;left:108"><nobr>branch target. This change is also made using a jump relocation at the appropriate code offset</nobr></div> | |
<div style="position:absolute;top:13419;left:108"><nobr>in that section</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:13469;left:108"><nobr>Handling Exceptions</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:13533;left:108"><nobr>Basic blocks that are involved in exception handling, that is, they either throw or catch an</nobr></div> | |
<div style="position:absolute;top:13555;left:108"><nobr>exception, are grouped together and placed in a single section. Unique sections are not created</nobr></div> | |
<div style="position:absolute;top:13578;left:108"><nobr>for such basic blocks. This is because the exception table is constructed using basic block</nobr></div> | |
<div style="position:absolute;top:13600;left:108"><nobr>offsets and creating unique sections for such basic blocks would require more static relocations.</nobr></div> | |
<div style="position:absolute;top:13623;left:108"><nobr>Creating sections for exception handling basic blocks will be part of future work. Benchmarks</nobr></div> | |
<div style="position:absolute;top:13645;left:108"><nobr>which spend a significant amount of time handling exceptions might not get the optimization</nobr></div> | |
<div style="position:absolute;top:13668;left:108"><nobr>benefits of Propeller.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:13717;left:108"><nobr>Updating DebugInfo and CFI</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:13782;left:108"><nobr>Generating correct debug information (DebugInfo) and Call Frame Information (CFI) with basic</nobr></div> | |
<div style="position:absolute;top:13804;left:108"><nobr>block sections is challenging. Since basic blocks coming from different functions can be</nobr></div> | |
<div style="position:absolute;top:13827;left:108"><nobr>arbitrarily reordered and mixed together, we must appropriately update the DebugInfo and CFI.</nobr></div> | |
<div style="position:absolute;top:13872;left:108"><nobr>DebugInfo is easier compared to CFI as we can leverage the DW_AT_ranges tag which allows</nobr></div> | |
<div style="position:absolute;top:13894;left:108"><nobr>description of a possibly non-contiguous range of addresses occupied by an entity. Thus, every</nobr></div> | |
<div style="position:absolute;top:13917;left:108"><nobr>basic block section forces a separate entry in DW_AT_ranges, plus two relocations pointing to</nobr></div> | |
<div style="position:absolute;top:13939;left:108"><nobr>symbols at the start and end of the basic block, respectively. DebugInfo will bloat object file</nobr></div> | |
<div style="position:absolute;top:13962;left:108"><nobr>sizes further with basic block sections. </nobr></div> | |
<div style="position:absolute;top:14007;left:108"><nobr>On the other hand, CFI doesn’t provide any easy way to specify non-contiguous range of</nobr></div> | |
<div style="position:absolute;top:14029;left:108"><nobr>addresses occupied by a function – the DWARF standard explicitly requires emitting separate</nobr></div> | |
<div style="position:absolute;top:14052;left:108"><nobr>CFI Frame Descriptor Entries for each contiguous fragment of a function. Thus, the CFI</nobr></div> | |
<div style="position:absolute;top:14074;left:108"><nobr>information for all callee-saved registers (possibly including the frame pointer, if necessary)</nobr></div> | |
<div style="position:absolute;top:14097;left:108"><nobr>have to be emitted along with redefining the Call Frame Address (CFA), viz. where the current</nobr></div> | |
<div style="position:absolute;top:14119;left:108"><nobr>frame starts. </nobr></div> | |
<div style="position:absolute;top:14164;left:108"><nobr>This causes a significant bloat of the .eh_frame sections, which is partially mitigated by</nobr></div> | |
<div style="position:absolute;top:14187;left:108"><nobr>de-duplicating common CFI instructions to the CFI Common Information Entry. We only</nobr></div> | |
<div style="position:absolute;top:14209;left:108"><nobr>de-duplicate CFI instructions with offset 0 from the beginning of the CFI frame, i.e. those that</nobr></div> | |
<div style="position:absolute;top:14232;left:108"><nobr>describe the CFI state before entering the frame. </nobr></div> | |
<div style="position:absolute;top:14277;left:108"><nobr>Having support for non-contiguous ranges in CFI would significantly minimize the size</nobr></div> | |
<div style="position:absolute;top:14299;left:108"><nobr>overheads and complexity of supporting basic block sections. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:14431;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="13"><b>Page 13</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:14562;left:108"><nobr>To allow easy basic block rewriting in the linker (e.g. removing unnecessary fall-through jumps),</nobr></div> | |
<div style="position:absolute;top:14585;left:108"><nobr>we force relocations against symbols and not sections. Moreover, in cases where the range is</nobr></div> | |
<div style="position:absolute;top:14607;left:108"><nobr>represented in DWARF as start and length, we defer the length calculation to the link stage, by</nobr></div> | |
<div style="position:absolute;top:14630;left:108"><nobr>emitting an appropriate SIZE relocation instead of hardcoding the length directly in the object file</nobr></div> | |
<div style="position:absolute;top:14652;left:108"><nobr>by the compiler.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:14724;left:108"><nobr>Example</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:14796;left:116"><nobr>bbsection_example.cc </nobr></div> | |
<div style="position:absolute;top:14819;left:116"><nobr>-------------------- </nobr></div> | |
<div style="position:absolute;top:14841;left:116"><nobr> </nobr></div> | |
<div style="position:absolute;top:14864;left:116"><nobr>__attribute__((noinline)) </nobr></div> | |
<div style="position:absolute;top:14886;left:116"><nobr>int baz(int count) { </nobr></div> | |
<div style="position:absolute;top:14909;left:136"><nobr>if (count % 2 == 0) </nobr></div> | |
<div style="position:absolute;top:14931;left:155"><nobr>return bar(); </nobr></div> | |
<div style="position:absolute;top:14954;left:136"><nobr>else </nobr></div> | |
<div style="position:absolute;top:14976;left:155"><nobr>return foo(); </nobr></div> | |
<div style="position:absolute;top:14999;left:116"><nobr>} </nobr></div> | |
<div style="position:absolute;top:15021;left:116"><nobr> </nobr></div> | |
<div style="position:absolute;top:15044;left:116"><nobr>int main(int argc, char *argv[]) { </nobr></div> | |
<div style="position:absolute;top:15066;left:136"><nobr>return baz(argc); </nobr></div> | |
<div style="position:absolute;top:15089;left:116"><nobr>}</nobr></div> | |
<div style="position:absolute;top:15166;left:108"><nobr>In the above example, function baz will have a couple of basic blocks. Generate basic block</nobr></div> | |
<div style="position:absolute;top:15189;left:108"><nobr>sections using:</nobr></div> | |
<div style="position:absolute;top:15264;left:116"><nobr>$ clang -fbasicblock-sections=all bbsection_example.cc -S</nobr></div> | |
<div style="position:absolute;top:15342;left:108"><nobr>and inspecting the assembly file:</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000"> | |
<div style="position:absolute;top:15418;left:170"><nobr>.section</nobr></div> | |
<div style="position:absolute;top:15418;left:278"><nobr>.text._Z3bazi,"ax",@progbits</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:15441;left:170"><nobr><b>.globl _Z3bazi </b></nobr></div> | |
<div style="position:absolute;top:15441;left:362"><nobr><b># -- Begin function _Z3bazi</b></nobr></div> | |
<div style="position:absolute;top:15463;left:170"><nobr><b>.p2align</b></nobr></div> | |
<div style="position:absolute;top:15463;left:278"><nobr><b>4, 0x90 </b></nobr></div> | |
</span></font> | |
<div style="position:absolute;top:15619;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="14"><b>Page 14</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:15737;left:170"><nobr><b>.type _Z3bazi,@function</b></nobr></div> | |
<div style="position:absolute;top:15759;left:116"><nobr><b>_Z3bazi: </b></nobr></div> | |
<div style="position:absolute;top:15759;left:328"><nobr><b># @_Z3bazi</b></nobr></div> | |
<div style="position:absolute;top:15782;left:170"><nobr>.cfi_startproc</nobr></div> | |
<div style="position:absolute;top:15804;left:170"><nobr>pushq %rbp</nobr></div> | |
<div style="position:absolute;top:15827;left:170"><nobr>.cfi_def_cfa_offset 16</nobr></div> | |
<div style="position:absolute;top:15849;left:170"><nobr>.cfi_offset %rbp, -16</nobr></div> | |
<div style="position:absolute;top:15872;left:170"><nobr>movq %rsp, %rbp</nobr></div> | |
<div style="position:absolute;top:15894;left:170"><nobr>.cfi_def_cfa_register %rbp</nobr></div> | |
<div style="position:absolute;top:15917;left:170"><nobr>subq $16, %rsp</nobr></div> | |
<div style="position:absolute;top:15939;left:171"><nobr>....</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff"> | |
<div style="position:absolute;top:15962;left:170"><nobr>jne</nobr></div> | |
<div style="position:absolute;top:15962;left:224"><nobr>aa.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:15984;left:170"><nobr>jmp</nobr></div> | |
<div style="position:absolute;top:15984;left:224"><nobr>a.BB._Z3bazi</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:16007;left:170"><nobr>.cfi_endproc</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000"> | |
<div style="position:absolute;top:16029;left:170"><nobr>.section</nobr></div> | |
<div style="position:absolute;top:16029;left:278"><nobr>.text,"ax",@progbits,unique,1</nobr></div> | |
<div style="position:absolute;top:16052;left:116"><nobr>a.BB._Z3bazi: </nobr></div> | |
<div style="position:absolute;top:16052;left:343"><nobr># %if.then</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:16074;left:170"><nobr>.cfi_startproc</nobr></div> | |
<div style="position:absolute;top:16097;left:170"><nobr>.cfi_def_cfa %rbp, 16</nobr></div> | |
<div style="position:absolute;top:16119;left:170"><nobr>.cfi_offset %rbp, -16</nobr></div> | |
<div style="position:absolute;top:16142;left:171"><nobr>...</nobr></div> | |
<div style="position:absolute;top:16164;left:170"><nobr>jmp</nobr></div> | |
<div style="position:absolute;top:16164;left:224"><nobr>aaa.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:16187;left:170"><nobr>.size a.BB._Z3bazi, .Ltmp0-a.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:16209;left:170"><nobr>.cfi_endproc</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000"> | |
<div style="position:absolute;top:16232;left:170"><nobr>.section</nobr></div> | |
<div style="position:absolute;top:16232;left:278"><nobr>.text,"ax",@progbits,unique,2</nobr></div> | |
<div style="position:absolute;top:16254;left:116"><nobr>aa.BB._Z3bazi: </nobr></div> | |
<div style="position:absolute;top:16254;left:352"><nobr># %if.else</nobr></div> | |
<div style="position:absolute;top:16277;left:157"><nobr> <font color="#000000">.cfi_startproc</font></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:16299;left:166"><nobr>...</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff"> | |
<div style="position:absolute;top:16322;left:170"><nobr>jmp</nobr></div> | |
<div style="position:absolute;top:16322;left:224"><nobr>aaa.BB._Z3bazi</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:16344;left:166"><nobr>.cfi_endproc</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000"> | |
<div style="position:absolute;top:16367;left:170"><nobr>.section</nobr></div> | |
<div style="position:absolute;top:16367;left:278"><nobr>.text,"ax",@progbits,unique,3</nobr></div> | |
<div style="position:absolute;top:16389;left:116"><nobr>aaa.BB._Z3bazi: </nobr></div> | |
<div style="position:absolute;top:16389;left:361"><nobr># %return</nobr></div> | |
<div style="position:absolute;top:16412;left:166"><nobr><font color="#000000">.cfi_startproc</font></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:16434;left:171"><nobr>…</nobr></div> | |
<div style="position:absolute;top:16457;left:171"><nobr>ret</nobr></div> | |
<div style="position:absolute;top:16479;left:171"><nobr>.cfi_endproc</nobr></div> | |
<div style="position:absolute;top:16502;left:170"><nobr>.size _Z3bazi, .Lfunc_end2-_Z3bazi</nobr></div> | |
<div style="position:absolute;top:16578;left:135"><nobr>1. Four basic blocks are present for function baz (_Z3bazi) and four section directives</nobr></div> | |
<div style="position:absolute;top:16601;left:162"><nobr>separate them into different “.text.” sections.</nobr></div> | |
<div style="position:absolute;top:16623;left:135"><nobr>2. The entry basic block is used to represent the function and hence is part of the function</nobr></div> | |
<div style="position:absolute;top:16646;left:162"><nobr>section “.text._Z3bazi”. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:16807;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="15"><b>Page 15</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:16916;left:135"><nobr>3. The other three basic blocks “a.BB._Z3bazi”, “aa.BB._Z3bazi” and “aaa.BB._Z3bazi” are</nobr></div> | |
<div style="position:absolute;top:16938;left:162"><nobr>in unnamed ELF sections. The sections are unnamed to reduce the object size bloat</nobr></div> | |
<div style="position:absolute;top:16961;left:162"><nobr>from naming these sections. A separate option called -funique-bb-section-names</nobr></div> | |
<div style="position:absolute;top:16983;left:162"><nobr>generates the long names for the sections.</nobr></div> | |
<div style="position:absolute;top:17006;left:135"><nobr>4. The sections however would be prefixed as “.text”, “.text.hot” or “.text.unlikely” if profile</nobr></div> | |
<div style="position:absolute;top:17028;left:162"><nobr>information for the basic blocks is available. This is to allow grouping of hot basic blocks</nobr></div> | |
<div style="position:absolute;top:17051;left:162"><nobr>by the linker if needed.</nobr></div> | |
<div style="position:absolute;top:17073;left:135"><nobr>5. A fall-through branch between basic blocks “_Z3bazi” and “a.BB._Z3bazi” is converted</nobr></div> | |
<div style="position:absolute;top:17096;left:162"><nobr>into an explicit unconditional direct jump instruction, marked in blue. This is needed to</nobr></div> | |
<div style="position:absolute;top:17118;left:162"><nobr>allow these two basic blocks to be floated independently in the address space.</nobr></div> | |
<div style="position:absolute;top:17141;left:135"><nobr>6. All the basic blocks with sections created are labelled as “[a]+.BB._Z3bazi”. This does</nobr></div> | |
<div style="position:absolute;top:17163;left:162"><nobr>bloat the symbol table but keeps the strtab bloat very small due to suffix compression.</nobr></div> | |
<div style="position:absolute;top:17186;left:135"><nobr>7. Each basic block section is placed in its own unique CFI FDE. We repeat all CFI</nobr></div> | |
<div style="position:absolute;top:17208;left:162"><nobr>instructions needed to restore states of registers and frame address and have a</nobr></div> | |
<div style="position:absolute;top:17231;left:162"><nobr>de-duplication pass to reduce the overhead of CFI information. Since CFI does not</nobr></div> | |
<div style="position:absolute;top:17253;left:162"><nobr>support address ranges like DWARF debug info does, we have to pay a significant</nobr></div> | |
<div style="position:absolute;top:17276;left:162"><nobr>penalty for size bloat.</nobr></div> | |
<div style="position:absolute;top:17321;left:108"><nobr>Now, let us look the object file for this function:</nobr></div> | |
<div style="position:absolute;top:17396;left:116"><nobr>$ clang -fbasicblock-sections=all bbsection_example.cc -S</nobr></div> | |
<div style="position:absolute;top:17419;left:116"><nobr>$ objdump -dr bbsection_example.o </nobr></div> | |
<div style="position:absolute;top:17474;left:108"><nobr>Inspecting the object file contents for function baz (_Z3bazi) :</nobr></div> | |
<div style="position:absolute;top:17527;left:116"><nobr><b>Disassembly of section .text._Z3bazi: </b></nobr></div> | |
<div style="position:absolute;top:17549;left:116"><nobr><b> </b></nobr></div> | |
<div style="position:absolute;top:17572;left:116"><nobr><b>0000000000000000 <_Z3bazi>: </b></nobr></div> | |
<div style="position:absolute;top:17594;left:146"><nobr>0: 55 </nobr></div> | |
<div style="position:absolute;top:17594;left:432"><nobr>push %rbp </nobr></div> | |
<div style="position:absolute;top:17617;left:146"><nobr>1: 48 89 e5 </nobr></div> | |
<div style="position:absolute;top:17617;left:432"><nobr>mov %rsp,%rbp </nobr></div> | |
<div style="position:absolute;top:17639;left:146"><nobr>4: 48 83 ec 10 </nobr></div> | |
<div style="position:absolute;top:17639;left:432"><nobr>sub $0x10,%rsp </nobr></div> | |
<div style="position:absolute;top:17662;left:146"><nobr>8: 89 7d f8 </nobr></div> | |
<div style="position:absolute;top:17662;left:432"><nobr>mov %edi,-0x8(%rbp) </nobr></div> | |
<div style="position:absolute;top:17684;left:146"><nobr>b: 8b 45 f8 </nobr></div> | |
<div style="position:absolute;top:17684;left:432"><nobr>mov -0x8(%rbp),%eax </nobr></div> | |
<div style="position:absolute;top:17707;left:146"><nobr>e: 99 </nobr></div> | |
<div style="position:absolute;top:17707;left:432"><nobr>cltd </nobr></div> | |
<div style="position:absolute;top:17729;left:146"><nobr>f: b9 02 00 00 00 </nobr></div> | |
<div style="position:absolute;top:17729;left:432"><nobr>mov $0x2,%ecx </nobr></div> | |
<div style="position:absolute;top:17752;left:136"><nobr>14: f7 f9 </nobr></div> | |
<div style="position:absolute;top:17752;left:432"><nobr>idiv %ecx </nobr></div> | |
<div style="position:absolute;top:17774;left:136"><nobr>16: 83 fa 00 </nobr></div> | |
<div style="position:absolute;top:17774;left:432"><nobr>cmp $0x0,%edx </nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff"> | |
<div style="position:absolute;top:17797;left:136"><nobr>19: 0f 85 00 00 00 00 jne 1f <_Z3bazi+0x1f> </nobr></div> | |
<div style="position:absolute;top:17819;left:353"><nobr>1b: R_X86_64_PC32 aa.BB._Z3bazi-0x4 </nobr></div> | |
<div style="position:absolute;top:17842;left:136"><nobr>1f: e9 00 00 00 00 </nobr></div> | |
<div style="position:absolute;top:17842;left:432"><nobr>jmpq 24 <_Z3bazi+0x24> </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:17995;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="16"><b>Page 16</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff"> | |
<div style="position:absolute;top:18112;left:353"><nobr>20: R_X86_64_PLT32 a.BB._Z3bazi-0x4 </nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:18189;left:135"><nobr>● At offset 19, a relocation has now been created for the conditional jump instruction and</nobr></div> | |
<div style="position:absolute;top:18212;left:162"><nobr>is a 6 byte instruction. These additional relocations bloat up object file sizes.</nobr></div> | |
<div style="position:absolute;top:18234;left:135"><nobr>● The linker will relax and shrink branch instructions if the jump is closer.</nobr></div> | |
<div style="position:absolute;top:18279;left:108"><nobr>In the symbol table below, note that the basic block symbols are all local.</nobr></div> | |
<div style="position:absolute;top:18333;left:116"><nobr>Symbol table '.symtab' contains 12 entries:</nobr></div> | |
<div style="position:absolute;top:18356;left:130"><nobr>Num: Value </nobr></div> | |
<div style="position:absolute;top:18356;left:275"><nobr>Size Type Bind Vis Ndx Name</nobr></div> | |
<div style="position:absolute;top:18378;left:139"><nobr>0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND </nobr></div> | |
<div style="position:absolute;top:18401;left:139"><nobr>1: 0000000000000000 0 FILE LOCAL DEFAULT ABS section_example.cc</nobr></div> | |
<div style="position:absolute;top:18423;left:139"><nobr>7: 0000000000000000 13 NOTYPE LOCAL DEFAULT 5 a.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:18446;left:139"><nobr>8: 0000000000000000 13 NOTYPE LOCAL DEFAULT 7 aa.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:18468;left:139"><nobr>9: 0000000000000000 9 NOTYPE LOCAL DEFAULT 9 aaa.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:18491;left:134"><nobr>10: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _Z3barv</nobr></div> | |
<div style="position:absolute;top:18513;left:134"><nobr>11: 0000000000000000 36 FUNC GLOBAL DEFAULT 3 _Z3bazi</nobr></div> | |
<div style="position:absolute;top:18536;left:134"><nobr>12: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _Z3foov</nobr></div> | |
<div style="position:absolute;top:18558;left:134"><nobr>13: 0000000000000000 36 FUNC GLOBAL DEFAULT 10 main</nobr></div> | |
<div style="position:absolute;top:18612;left:108"><nobr>Finally, let us look at the binary where we use lld and a symbol ordering file to place each basic</nobr></div> | |
<div style="position:absolute;top:18635;left:108"><nobr>block of baz at non-contiguous locations.</nobr></div> | |
<div style="position:absolute;top:18710;left:117"><nobr>$ cat > ./symbol_file.txt </nobr></div> | |
<div style="position:absolute;top:18733;left:117"><nobr>aa.BB._Z3bazi </nobr></div> | |
<div style="position:absolute;top:18755;left:117"><nobr>main </nobr></div> | |
<div style="position:absolute;top:18778;left:117"><nobr>_Z3bazi </nobr></div> | |
<div style="position:absolute;top:18800;left:117"><nobr>_Z3barv </nobr></div> | |
<div style="position:absolute;top:18823;left:117"><nobr>a.BB._Z3bazi </nobr></div> | |
<div style="position:absolute;top:18845;left:117"><nobr>_Z3foov </nobr></div> | |
<div style="position:absolute;top:18868;left:117"><nobr>aaa.BB._Z3bazi </nobr></div> | |
<div style="position:absolute;top:18890;left:117"><nobr>$ clang -fbasicblock-sections=all bbsection_example.cc \ </nobr></div> | |
<div style="position:absolute;top:18913;left:147"><nobr>-Wl,--symbol-ordering-file,./symbol_file.txt</nobr></div> | |
<div style="position:absolute;top:18935;left:117"><nobr>$ nm -n a.out </nobr></div> | |
<div style="position:absolute;top:18990;left:108"><nobr>The symbols sorted by address are as follows:</nobr></div> | |
<div style="position:absolute;top:19044;left:116"><nobr>0000000000201000 t aa.BB._Z3bazi </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:19183;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="17"><b>Page 17</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:19301;left:116"><nobr>0000000000201010 T main</nobr></div> | |
<div style="position:absolute;top:19323;left:116"><nobr>0000000000201040 T _Z3bazi</nobr></div> | |
<div style="position:absolute;top:19346;left:116"><nobr>0000000000201060 T _Z3barv</nobr></div> | |
<div style="position:absolute;top:19368;left:116"><nobr>0000000000201068 t a.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:19391;left:116"><nobr>0000000000201080 T _Z3foov</nobr></div> | |
<div style="position:absolute;top:19413;left:116"><nobr>0000000000201088 t aaa.BB._Z3bazi</nobr></div> | |
<div style="position:absolute;top:19490;left:108"><nobr>Notice how each basic block is now located as specified in the symbol ordering file. The basic</nobr></div> | |
<div style="position:absolute;top:19512;left:108"><nobr>block section symbols are useful while debugging if the basic block is not contiguous with the</nobr></div> | |
<div style="position:absolute;top:19535;left:108"><nobr>original section. The linker will delete a basic block label if the previous basic block for that</nobr></div> | |
<div style="position:absolute;top:19557;left:108"><nobr>label also corresponds to the same function. It is expected that a function would be partitioned</nobr></div> | |
<div style="position:absolute;top:19580;left:108"><nobr>in 2 ways at most and hence the remaining labels will be deleted by the linker in the interests of</nobr></div> | |
<div style="position:absolute;top:19602;left:108"><nobr>binary size.</nobr></div> | |
<div style="position:absolute;top:19647;left:108"><nobr>Also, another problem exists with static functions which have local linkage. Static function name</nobr></div> | |
<div style="position:absolute;top:19670;left:108"><nobr>symbols could be duplicated in the final binary when definitions from more than one module are</nobr></div> | |
<div style="position:absolute;top:19692;left:108"><nobr>linked. This causes problems with disambiguating the symbol name to the instance. We solved</nobr></div> | |
<div style="position:absolute;top:19715;left:108"><nobr>this problem by renaming static functions to include the module names too so that collisions</nobr></div> | |
<div style="position:absolute;top:19737;left:108"><nobr>among these instances are rare.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:19813;left:108"><nobr>The Layout Algorithm</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:19862;left:108"><nobr>Propeller performs intra-function basic block reordering based on the extended TSP model</nobr></div> | |
<div style="position:absolute;top:19885;left:108"><nobr>(<font color="#1155cc"><a href="https://arxiv.org/abs/1809.04676">https://arxiv.org/abs/1809.04676</a></font>)</nobr></div> | |
<div style="position:absolute;top:19885;left:385"><nobr>and</nobr></div> | |
<div style="position:absolute;top:19885;left:443"><nobr>function</nobr></div> | |
<div style="position:absolute;top:19885;left:531"><nobr>reordering</nobr></div> | |
<div style="position:absolute;top:19885;left:636"><nobr>based</nobr></div> | |
<div style="position:absolute;top:19885;left:710"><nobr>on</nobr></div> | |
<div style="position:absolute;top:19885;left:758"><nobr>HFSort</nobr></div> | |
<div style="position:absolute;top:19907;left:108"><nobr>(<font color="#1155cc"><a href="https://dl.acm.org/citation.cfm?id=3049858">https://dl.acm.org/citation.cfm?id=3049858</a></font><a href="https://dl.acm.org/citation.cfm?id=3049858"></a>). Propeller is also capable of splitting functions.</nobr></div> | |
<div style="position:absolute;top:19952;left:108"><nobr>The high-level description of these algorithms are as follows.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:20002;left:108"><nobr>ExtTSP Basic Block Reordering Algorithm</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:20043;left:108"><nobr>The ExtTSP (extended TSP) metric provides a score for every ordering of basic blocks in a</nobr></div> | |
<div style="position:absolute;top:20066;left:108"><nobr>function, by combining the gains from fall-throughs and short jumps.</nobr></div> | |
<div style="position:absolute;top:20111;left:108"><nobr>Given an ordering of the basic blocks, for a function f, the ExtTSP score is computed as follows. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:20371;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="18"><b>Page 18</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:20644;left:108"><nobr>In short, it computes a weighted sum of frequencies of all edges in the control flow graph. Each</nobr></div> | |
<div style="position:absolute;top:20667;left:108"><nobr>edge gets its weight depending on whether the given ordering makes the edge a fallthrough, a</nobr></div> | |
<div style="position:absolute;top:20689;left:108"><nobr>short forward jump, or a short backward jump.</nobr></div> | |
<div style="position:absolute;top:20734;left:108"><nobr>Although this problem is NP-hard like the regular TSP, an iterative greedy basic-block-chaining</nobr></div> | |
<div style="position:absolute;top:20757;left:108"><nobr>algorithm is used to find a close to optimal solution. This algorithm is described as follows.</nobr></div> | |
<div style="position:absolute;top:20802;left:108"><nobr>Starting with one basic block sequence (BB chain) for every basic block, the algorithm iteratively</nobr></div> | |
<div style="position:absolute;top:20824;left:108"><nobr>joins BB chains together in order to maximize the extended TSP score of all the chains.</nobr></div> | |
<div style="position:absolute;top:20847;left:108"><nobr>Initially, it finds all <i>mutually-forced </i>edges in the profiled cfg. These are all the edges which are --</nobr></div> | |
<div style="position:absolute;top:20869;left:108"><nobr>based on the profile -- the only (executed) outgoing edge from their source node and the only</nobr></div> | |
<div style="position:absolute;top:20892;left:108"><nobr>(executed) incoming edges to their sink nodes. Next, the source and sink of all mutually-forced</nobr></div> | |
<div style="position:absolute;top:20914;left:108"><nobr>edges are attached together as fallthrough edges.</nobr></div> | |
<div style="position:absolute;top:20959;left:108"><nobr>Then, at every iteration, the algorithm tries to merge a pair of BB chains which leads to the</nobr></div> | |
<div style="position:absolute;top:20982;left:108"><nobr>highest gain in the ExtTSP score. The algorithm extends the search space by considering</nobr></div> | |
<div style="position:absolute;top:21004;left:108"><nobr>splitting short (less than 128 bytes in binary size) BB chains into two chains and then merging</nobr></div> | |
<div style="position:absolute;top:21027;left:108"><nobr>these two chains with the other chain in four (additional) ways. After every merge, the new</nobr></div> | |
<div style="position:absolute;top:21049;left:108"><nobr>merge gains are updated. The algorithm repeats joining BB chains until no additional can be</nobr></div> | |
<div style="position:absolute;top:21072;left:108"><nobr>achieved. At this step, it sorts all the existing chains in decreasing order of their execution</nobr></div> | |
<div style="position:absolute;top:21094;left:108"><nobr>density, i.e., the total profiled frequency of the chain divided by its binary size.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:21166;left:108"><nobr>HFSort Function Reordering Algorithm</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:21208;left:108"><nobr>The HFSort function reordering algorithm computes an optimal ordering. It goes through all</nobr></div> | |
<div style="position:absolute;top:21230;left:108"><nobr>functions in decreasing order of their execution density and for each one, finds its <i>most likely</i></nobr></div> | |
<div style="position:absolute;top:21253;left:108"><nobr><i>caller </i><font style="font-size:15px"></font>(the function which calls it the most) and places the caller’s cluster right before the callee’s</nobr></div> | |
<div style="position:absolute;top:21275;left:108"><nobr>cluster. After all functions are considered, the HFSort algorithm orders all function clusters in</nobr></div> | |
<div style="position:absolute;top:21298;left:108"><nobr>decreasing order of their total execution density.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:21347;left:108"><nobr>Function Splitting</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:21389;left:108"><nobr>After BB reordering and function reordering, for every function, Propeller goes through the BB</nobr></div> | |
<div style="position:absolute;top:21411;left:108"><nobr>chains in the computed BB order for that function, and for each chain, places the basic blocks of</nobr></div> | |
</span></font> | |
<div style="position:absolute;top:21559;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="19"><b>Page 19</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:21668;left:108"><nobr>that chain at the hot or cold part of the layout depending on whether the chain’s total frequency</nobr></div> | |
<div style="position:absolute;top:21690;left:108"><nobr>is zero or not.</nobr></div> | |
<div style="position:absolute;top:21735;left:108"><nobr>The ExtTSP BB reordering algorithm orders the BB chains decreasing order of their execution</nobr></div> | |
<div style="position:absolute;top:21758;left:108"><nobr>density. As a result, cold basic blocks are placed at the end of the computed BB layout of every</nobr></div> | |
<div style="position:absolute;top:21780;left:108"><nobr>function.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:21852;left:108"><nobr>Ordering overhead</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:21894;left:108"><nobr>The total time complexity of BB reordering is O(V<font style="font-size:7px">3</font>E), where V is the number of (hot) basic</nobr></div> | |
<div style="position:absolute;top:21916;left:108"><nobr>blocks in a function and E is the total number of (profiled) edges in the function. This time</nobr></div> | |
<div style="position:absolute;top:21939;left:108"><nobr>complexity is derived as follows:</nobr></div> | |
<div style="position:absolute;top:21984;left:108"><nobr>Merging of BB chains can happen at most V times. Every time two BB chains are merged</nobr></div> | |
<div style="position:absolute;top:22006;left:108"><nobr>together, we must recompute the gains from merging the newly created chain with every other</nobr></div> | |
<div style="position:absolute;top:22029;left:108"><nobr>existing BB chain. A crude analysis shows that the time complexity of the gain recomputation</nobr></div> | |
<div style="position:absolute;top:22051;left:108"><nobr>phase is at most O(V<font style="font-size:7px">2</font>E). Multiplying this by V gives the claimed running time.</nobr></div> | |
<div style="position:absolute;top:22096;left:108"><nobr>The time complexity of the HFSort function reordering stage is O(N lg N + C), where N is the</nobr></div> | |
<div style="position:absolute;top:22119;left:108"><nobr>total number of profiled functions and C is the number of profiled call edges between functions.</nobr></div> | |
<div style="position:absolute;top:22164;left:108"><nobr>In practice the total ordering overhead is about 6 extra seconds of linker time for building clang</nobr></div> | |
<div style="position:absolute;top:22186;left:108"><nobr>and about 2 seconds for Search1.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:22239;left:108"><nobr>Experiments</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:22311;left:108"><nobr>The baseline benchmark is built with peak optimizations, PGO and ThinLTO, enabled. Also,</nobr></div> | |
<div style="position:absolute;top:22334;left:108"><nobr>Propeller has been used in two configurations. The first configuration, called "List", only creates</nobr></div> | |
<div style="position:absolute;top:22356;left:108"><nobr>basic block sections for functions with samples, this is the default configuration used with</nobr></div> | |
<div style="position:absolute;top:22379;left:108"><nobr>-fpropeller-optimize flag. The second configuration, called "all", creates basic blocks for all</nobr></div> | |
<div style="position:absolute;top:22401;left:108"><nobr>functions, enabled with -fbasicblock-sections=all. All the benchmarks were built for the X86_64</nobr></div> | |
<div style="position:absolute;top:22424;left:108"><nobr>platform on a 18 core Intel machine with 192 G of RAM. The SPEC benchmarks were also run</nobr></div> | |
<div style="position:absolute;top:22446;left:108"><nobr>on this machine and the large benchmarks have their own harnesses for performance</nobr></div> | |
<div style="position:absolute;top:22469;left:108"><nobr>measurement.</nobr></div> | |
<div style="position:absolute;top:22514;left:108"><nobr>BOLT built from upstream sources failed to optimize our large benchmarks mainly because it</nobr></div> | |
<div style="position:absolute;top:22536;left:108"><nobr>could not disassemble binaries when read-only data was mixed with code, or when multiple text</nobr></div> | |
<div style="position:absolute;top:22559;left:108"><nobr>sections were used. We have implemented a patch with BOLT to exclude optimizing functions it</nobr></div> | |
<div style="position:absolute;top:22581;left:108"><nobr>could not understand, this is done by using trampolines to jump between optimized and original</nobr></div> | |
</span></font> | |
<div style="position:absolute;top:22747;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="20"><b>Page 20</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:22856;left:108"><nobr>code. These problematic functions were not hot and did not directly affect performance. With</nobr></div> | |
<div style="position:absolute;top:22878;left:108"><nobr>this fix, BOLT is able to optimize all of our benchmarks.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:22928;left:108"><nobr>Memory Overhead</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:22969;left:108"><nobr>The memory overhead for the monolithic (non-parallel) action is compared here. BOLT does the</nobr></div> | |
<div style="position:absolute;top:22992;left:108"><nobr>disassembly, analysis and rewrite in one process and it is this overhead which is reported here.</nobr></div> | |
<div style="position:absolute;top:23014;left:108"><nobr>For Propeller and baseline, this overhead corresponds to the final link step. For both Propeller</nobr></div> | |
<div style="position:absolute;top:23037;left:108"><nobr>and BOLT, the memory needed to do profile conversion is not shown as this depends on</nobr></div> | |
<div style="position:absolute;top:23059;left:108"><nobr>several factors like the duration of profiling and the size of the profile. BOLT and Propeller’s</nobr></div> | |
<div style="position:absolute;top:23082;left:108"><nobr>profile conversion process is independent and has not been considered for comparison</nobr></div> | |
<div style="position:absolute;top:23136;left:132"><nobr>Benchmarks</nobr></div> | |
<div style="position:absolute;top:23136;left:285"><nobr>Baseline</nobr></div> | |
<div style="position:absolute;top:23136;left:443"><nobr>List</nobr></div> | |
<div style="position:absolute;top:23136;left:586"><nobr>All</nobr></div> | |
<div style="position:absolute;top:23136;left:713"><nobr>BOLT</nobr></div> | |
<div style="position:absolute;top:23173;left:146"><nobr>Search1</nobr></div> | |
<div style="position:absolute;top:23173;left:297"><nobr>6.8 G</nobr></div> | |
<div style="position:absolute;top:23173;left:432"><nobr>14.6 G</nobr></div> | |
<div style="position:absolute;top:23173;left:572"><nobr>34.9 G</nobr></div> | |
<div style="position:absolute;top:23173;left:718"><nobr>70 G</nobr></div> | |
<div style="position:absolute;top:23210;left:146"><nobr>Search2</nobr></div> | |
<div style="position:absolute;top:23210;left:297"><nobr>6.7 G</nobr></div> | |
<div style="position:absolute;top:23210;left:432"><nobr>11.7 G</nobr></div> | |
<div style="position:absolute;top:23210;left:572"><nobr>31.9 G</nobr></div> | |
<div style="position:absolute;top:23210;left:711"><nobr>64.7 G</nobr></div> | |
<div style="position:absolute;top:23247;left:149"><nobr>Storage</nobr></div> | |
<div style="position:absolute;top:23247;left:297"><nobr>4.6 G</nobr></div> | |
<div style="position:absolute;top:23247;left:437"><nobr>8.6 G</nobr></div> | |
<div style="position:absolute;top:23247;left:572"><nobr>17.8 G</nobr></div> | |
<div style="position:absolute;top:23247;left:711"><nobr>26.4 G</nobr></div> | |
<div style="position:absolute;top:23284;left:155"><nobr>Clang</nobr></div> | |
<div style="position:absolute;top:23284;left:297"><nobr>0.4 G</nobr></div> | |
<div style="position:absolute;top:23284;left:437"><nobr>2.9 G</nobr></div> | |
<div style="position:absolute;top:23284;left:576"><nobr>6.4 G</nobr></div> | |
<div style="position:absolute;top:23284;left:711"><nobr>33.9 G</nobr></div> | |
<div style="position:absolute;top:23322;left:135"><nobr>SPEC [min]</nobr></div> | |
<div style="position:absolute;top:23342;left:143"><nobr>(505.mcf)</nobr></div> | |
<div style="position:absolute;top:23322;left:298"><nobr>32 M</nobr></div> | |
<div style="position:absolute;top:23322;left:438"><nobr>36 M</nobr></div> | |
<div style="position:absolute;top:23322;left:577"><nobr>37 M</nobr></div> | |
<div style="position:absolute;top:23322;left:711"><nobr>30 MB</nobr></div> | |
<div style="position:absolute;top:23379;left:133"><nobr>SPEC [max]</nobr></div> | |
<div style="position:absolute;top:23399;left:116"><nobr>(523.xalancbmk)</nobr></div> | |
<div style="position:absolute;top:23379;left:294"><nobr>111 M</nobr></div> | |
<div style="position:absolute;top:23379;left:433"><nobr>161 M</nobr></div> | |
<div style="position:absolute;top:23379;left:573"><nobr>260 M</nobr></div> | |
<div style="position:absolute;top:23379;left:712"><nobr>822 M</nobr></div> | |
<div style="position:absolute;top:23451;left:108"><nobr>With Propeller, it is clear that the "List" configuration which creates basic block sections on</nobr></div> | |
<div style="position:absolute;top:23473;left:108"><nobr>demand does significantly better than "All". "List" is lower by more than 50 % compared to "All".</nobr></div> | |
<div style="position:absolute;top:23496;left:108"><nobr>This is primarily because a large percentage of the code is cold for large benchmarks (more</nobr></div> | |
<div style="position:absolute;top:23518;left:108"><nobr>than 80 %) and the "List" configuration eliminates the overhead of creating basic block sections</nobr></div> | |
<div style="position:absolute;top:23541;left:108"><nobr>for the cold parts. The code layout optimization ignores functions without samples and hence,</nobr></div> | |
<div style="position:absolute;top:23563;left:108"><nobr>this does not affect performance.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:23635;left:108"><nobr>Time Overhead</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:23677;left:108"><nobr>The time overheads correspond to the same actions for which memory overheads are</nobr></div> | |
<div style="position:absolute;top:23700;left:108"><nobr>presented.</nobr></div> | |
<div style="position:absolute;top:23745;left:108"><nobr>The data shows that Propeller, "list" configuration in particular, is much faster than BOLT in</nobr></div> | |
<div style="position:absolute;top:23767;left:108"><nobr>optimizing binaries but is slower than baseline. The additional time needed by Propeller is due</nobr></div> | |
<div style="position:absolute;top:23790;left:108"><nobr>to two reasons. First, the link action constructs the Inter-procedural control-flow graph (iCFG)</nobr></div> | |
</span></font> | |
<div style="position:absolute;top:23935;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="21"><b>Page 21</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:24044;left:108"><nobr>and runs the layout algorithm to discover an optimized ordering of basic blocks. Second, the</nobr></div> | |
<div style="position:absolute;top:24066;left:108"><nobr>input object file sizes are larger than the baseline due to basic block sections which impacts</nobr></div> | |
<div style="position:absolute;top:24089;left:108"><nobr>memory overhead and link time. Reducing CFI information bloat as discussed would reduce the</nobr></div> | |
<div style="position:absolute;top:24111;left:108"><nobr>link time considerably, by 10 % when building with "All" configuration</nobr></div> | |
<div style="position:absolute;top:24188;left:132"><nobr>Benchmarks</nobr></div> | |
<div style="position:absolute;top:24188;left:285"><nobr>Baseline</nobr></div> | |
<div style="position:absolute;top:24188;left:443"><nobr>List</nobr></div> | |
<div style="position:absolute;top:24188;left:586"><nobr>All</nobr></div> | |
<div style="position:absolute;top:24188;left:713"><nobr>BOLT</nobr></div> | |
<div style="position:absolute;top:24225;left:146"><nobr>Search1</nobr></div> | |
<div style="position:absolute;top:24225;left:302"><nobr>28 s</nobr></div> | |
<div style="position:absolute;top:24225;left:441"><nobr>87 s</nobr></div> | |
<div style="position:absolute;top:24225;left:576"><nobr>365 s</nobr></div> | |
<div style="position:absolute;top:24225;left:716"><nobr>675 s</nobr></div> | |
<div style="position:absolute;top:24262;left:146"><nobr>Search2</nobr></div> | |
<div style="position:absolute;top:24262;left:306"><nobr>7 s</nobr></div> | |
<div style="position:absolute;top:24262;left:437"><nobr>204 s</nobr></div> | |
<div style="position:absolute;top:24262;left:576"><nobr>565 s</nobr></div> | |
<div style="position:absolute;top:24262;left:716"><nobr>504 s</nobr></div> | |
<div style="position:absolute;top:24299;left:149"><nobr>Storage</nobr></div> | |
<div style="position:absolute;top:24299;left:302"><nobr>12 s</nobr></div> | |
<div style="position:absolute;top:24299;left:441"><nobr>49 s</nobr></div> | |
<div style="position:absolute;top:24299;left:576"><nobr>188 s</nobr></div> | |
<div style="position:absolute;top:24299;left:716"><nobr>500 s</nobr></div> | |
<div style="position:absolute;top:24336;left:155"><nobr>Clang</nobr></div> | |
<div style="position:absolute;top:24336;left:306"><nobr>5 s</nobr></div> | |
<div style="position:absolute;top:24336;left:441"><nobr>31 s</nobr></div> | |
<div style="position:absolute;top:24336;left:581"><nobr>53 s</nobr></div> | |
<div style="position:absolute;top:24336;left:716"><nobr>148 s</nobr></div> | |
<div style="position:absolute;top:24373;left:135"><nobr>SPEC [min]</nobr></div> | |
<div style="position:absolute;top:24394;left:118"><nobr>(531.deepsjeng)</nobr></div> | |
<div style="position:absolute;top:24373;left:295"><nobr>0.08 s</nobr></div> | |
<div style="position:absolute;top:24373;left:434"><nobr>0.29 s</nobr></div> | |
<div style="position:absolute;top:24373;left:574"><nobr>0.30 s</nobr></div> | |
<div style="position:absolute;top:24373;left:713"><nobr>0.98 s</nobr></div> | |
<div style="position:absolute;top:24431;left:133"><nobr>SPEC [max]</nobr></div> | |
<div style="position:absolute;top:24451;left:116"><nobr>(523.xalancbmk)</nobr></div> | |
<div style="position:absolute;top:24431;left:295"><nobr>0.31 s</nobr></div> | |
<div style="position:absolute;top:24431;left:434"><nobr>0.93 s</nobr></div> | |
<div style="position:absolute;top:24431;left:574"><nobr>2.08 s</nobr></div> | |
<div style="position:absolute;top:24431;left:713"><nobr>4.35 s</nobr></div> | |
<div style="position:absolute;top:24570;left:108"><nobr>An alternate design was considered where each module constructs a subset of the CFG and</nobr></div> | |
<div style="position:absolute;top:24593;left:108"><nobr>performs an intraprocedural code layout as part of the back-end action. While this would have</nobr></div> | |
<div style="position:absolute;top:24615;left:108"><nobr>been faster as it could be done in parallel or distributed, code layout would be limited to</nobr></div> | |
<div style="position:absolute;top:24638;left:108"><nobr>intra-procedural decisions. </nobr></div> | |
<div style="position:absolute;top:24683;left:108"><nobr>The time taken by Propeller and BOLT to profile the benchmark and convert the profiles was</nobr></div> | |
<div style="position:absolute;top:24705;left:108"><nobr>measured separately. For the Search1 benchmark, we profiled for 30 seconds, and the raw</nobr></div> | |
<div style="position:absolute;top:24728;left:108"><nobr>PMU profile size was ∼ 525M. Propeller needed a minute to convert this profile into its custom</nobr></div> | |
<div style="position:absolute;top:24750;left:108"><nobr>format, whereas, BOLT needed more than 4 minutes</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:24822;left:108"><nobr>Object and Binary Size Bloats with Propeller</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:24886;left:108"><nobr>This section presents data on the bloats to native object file sizes and the final optimized binary</nobr></div> | |
<div style="position:absolute;top:24909;left:108"><nobr>file sizes with Propeller. This data also shows the bloats to the peak optimized binary from</nobr></div> | |
<div style="position:absolute;top:24931;left:108"><nobr>labelling basic blocks with unique symbols. The SPEC benchmark sizes are not shown as they</nobr></div> | |
<div style="position:absolute;top:24954;left:108"><nobr>are much smaller in comparison. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:25123;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="22"><b>Page 22</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:25256;left:108"><nobr>Object File Sizes Bloat</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:25313;left:108"><nobr>The table shows the increase in object files sizes over baseline with the two Propeller</nobr></div> | |
<div style="position:absolute;top:25335;left:108"><nobr>configurations, “List” and “All”. The figures also show the increase from Call Frame Information</nobr></div> | |
<div style="position:absolute;top:25358;left:108"><nobr>(CFI) alone which ends up in the .eh_frame sections in the object files. CFI alone is responsible</nobr></div> | |
<div style="position:absolute;top:25380;left:108"><nobr>for 10 % of the bloat with "All" configuration and, as mentioned in 5.2, reducing CFI sizes would</nobr></div> | |
<div style="position:absolute;top:25403;left:108"><nobr>help a lot. Creating basic block sections on demand is the scalable approach to keep the bloats</nobr></div> | |
<div style="position:absolute;top:25425;left:108"><nobr>tractable.</nobr></div> | |
<div style="position:absolute;top:25547;left:120"><nobr>Benchmarks</nobr></div> | |
<div style="position:absolute;top:25547;left:289"><nobr>Baseline</nobr></div> | |
<div style="position:absolute;top:25547;left:488"><nobr>List</nobr></div> | |
<div style="position:absolute;top:25547;left:689"><nobr>All</nobr></div> | |
<div style="position:absolute;top:25584;left:241"><nobr>.ehframe</nobr></div> | |
<div style="position:absolute;top:25584;left:350"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:25584;left:425"><nobr>.ehframe</nobr></div> | |
<div style="position:absolute;top:25584;left:524"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:25584;left:609"><nobr>.ehframe</nobr></div> | |
<div style="position:absolute;top:25584;left:735"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:25621;left:136"><nobr>Search1</nobr></div> | |
<div style="position:absolute;top:25621;left:249"><nobr>83.3 M</nobr></div> | |
<div style="position:absolute;top:25621;left:349"><nobr>2.5 G</nobr></div> | |
<div style="position:absolute;top:25621;left:435"><nobr>117 M</nobr></div> | |
<div style="position:absolute;top:25621;left:519"><nobr>3.64 G</nobr></div> | |
<div style="position:absolute;top:25621;left:619"><nobr>996 M</nobr></div> | |
<div style="position:absolute;top:25621;left:729"><nobr>8.73 G</nobr></div> | |
<div style="position:absolute;top:25658;left:136"><nobr>Search2</nobr></div> | |
<div style="position:absolute;top:25658;left:249"><nobr>83.5 M</nobr></div> | |
<div style="position:absolute;top:25658;left:349"><nobr>2.5 G</nobr></div> | |
<div style="position:absolute;top:25658;left:435"><nobr>135 M</nobr></div> | |
<div style="position:absolute;top:25658;left:523"><nobr>3.7 G</nobr></div> | |
<div style="position:absolute;top:25658;left:619"><nobr>990 M</nobr></div> | |
<div style="position:absolute;top:25658;left:734"><nobr>8.7 G</nobr></div> | |
<div style="position:absolute;top:25695;left:137"><nobr>Storage</nobr></div> | |
<div style="position:absolute;top:25695;left:249"><nobr>24.4 M</nobr></div> | |
<div style="position:absolute;top:25695;left:349"><nobr>1.3 G</nobr></div> | |
<div style="position:absolute;top:25695;left:440"><nobr>45 M</nobr></div> | |
<div style="position:absolute;top:25695;left:523"><nobr>2.3 G</nobr></div> | |
<div style="position:absolute;top:25695;left:619"><nobr>165 M</nobr></div> | |
<div style="position:absolute;top:25695;left:734"><nobr>4.9 G</nobr></div> | |
<div style="position:absolute;top:25732;left:145"><nobr>Clang</nobr></div> | |
<div style="position:absolute;top:25732;left:249"><nobr>10.2 M</nobr></div> | |
<div style="position:absolute;top:25732;left:345"><nobr>351 M</nobr></div> | |
<div style="position:absolute;top:25732;left:433"><nobr>94.8 M</nobr></div> | |
<div style="position:absolute;top:25732;left:518"><nobr>64.7 M</nobr></div> | |
<div style="position:absolute;top:25732;left:619"><nobr>243 M</nobr></div> | |
<div style="position:absolute;top:25732;left:734"><nobr>1.4 G</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:25884;left:108"><nobr>Input Binary File Size Bloat</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:25941;left:108"><nobr>The table below shows the increase in the binary sizes of the labelled Propeller binary that is</nobr></div> | |
<div style="position:absolute;top:25963;left:108"><nobr>used to collect profiles. It also shows the bloat in the BOLT binary that is used as input to BOLT</nobr></div> | |
<div style="position:absolute;top:25986;left:108"><nobr>with static relocations enabled. With Propeller, the bloat is mainly due to the symbol table which</nobr></div> | |
<div style="position:absolute;top:26008;left:108"><nobr>now accommodates an entry for every basic block symbol. The string table ("strtab") is also</nobr></div> | |
<div style="position:absolute;top:26031;left:108"><nobr>bloated (∼ 8% for large binaries) due to the extra symbol names but the unary naming scheme</nobr></div> | |
<div style="position:absolute;top:26053;left:108"><nobr>keeps this small. However, these bloats do not affect performance as the symbol table and the</nobr></div> | |
<div style="position:absolute;top:26076;left:108"><nobr>string table are not loaded onto memory. For BOLT, the bloat mainly comes from saving all the</nobr></div> | |
<div style="position:absolute;top:26098;left:108"><nobr>static relocations in the final binary, which is very useful to disassemble the binary, and does not</nobr></div> | |
<div style="position:absolute;top:26121;left:108"><nobr>affect performance. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:26311;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="23"><b>Page 23</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:26451;left:65"><nobr>Benchmarks</nobr></div> | |
<div style="position:absolute;top:26451;left:245"><nobr>Baseline</nobr></div> | |
<div style="position:absolute;top:26451;left:476"><nobr>Labels</nobr></div> | |
<div style="position:absolute;top:26451;left:712"><nobr>BOLT</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:12px;font-family:Times"> | |
<div style="position:absolute;top:26489;left:187"><nobr>sym +</nobr></div> | |
<div style="position:absolute;top:26507;left:186"><nobr>str tab</nobr></div> | |
<div style="position:absolute;top:26489;left:257"><nobr>relocs</nobr></div> | |
<div style="position:absolute;top:26489;left:331"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:26489;left:398"><nobr>sym +</nobr></div> | |
<div style="position:absolute;top:26507;left:397"><nobr>str tab</nobr></div> | |
<div style="position:absolute;top:26489;left:475"><nobr>relocs</nobr></div> | |
<div style="position:absolute;top:26489;left:560"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:26489;left:629"><nobr>sym +</nobr></div> | |
<div style="position:absolute;top:26507;left:628"><nobr>str tab</nobr></div> | |
<div style="position:absolute;top:26489;left:701"><nobr>relocs</nobr></div> | |
<div style="position:absolute;top:26489;left:789"><nobr>Total</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:26541;left:81"><nobr>Search1</nobr></div> | |
<div style="position:absolute;top:26541;left:183"><nobr>251 M</nobr></div> | |
<div style="position:absolute;top:26541;left:271"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26541;left:327"><nobr>1.7 G</nobr></div> | |
<div style="position:absolute;top:26541;left:396"><nobr>579 M</nobr></div> | |
<div style="position:absolute;top:26541;left:491"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26541;left:557"><nobr>2.2 G</nobr></div> | |
<div style="position:absolute;top:26541;left:626"><nobr>253 M</nobr></div> | |
<div style="position:absolute;top:26541;left:698"><nobr>253 M</nobr></div> | |
<div style="position:absolute;top:26541;left:792"><nobr>2 G</nobr></div> | |
<div style="position:absolute;top:26578;left:81"><nobr>Search2</nobr></div> | |
<div style="position:absolute;top:26578;left:183"><nobr>252 M</nobr></div> | |
<div style="position:absolute;top:26578;left:271"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26578;left:327"><nobr>1.7 G</nobr></div> | |
<div style="position:absolute;top:26578;left:396"><nobr>577 M</nobr></div> | |
<div style="position:absolute;top:26578;left:491"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26578;left:557"><nobr>2.2 G</nobr></div> | |
<div style="position:absolute;top:26578;left:626"><nobr>257 M</nobr></div> | |
<div style="position:absolute;top:26578;left:698"><nobr>263 M</nobr></div> | |
<div style="position:absolute;top:26578;left:792"><nobr>2 G</nobr></div> | |
<div style="position:absolute;top:26616;left:82"><nobr>Storage</nobr></div> | |
<div style="position:absolute;top:26616;left:188"><nobr>65 M</nobr></div> | |
<div style="position:absolute;top:26616;left:271"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26616;left:327"><nobr>1.1 G</nobr></div> | |
<div style="position:absolute;top:26616;left:396"><nobr>209 M</nobr></div> | |
<div style="position:absolute;top:26616;left:491"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26616;left:557"><nobr>1.5 G</nobr></div> | |
<div style="position:absolute;top:26616;left:630"><nobr>68 M</nobr></div> | |
<div style="position:absolute;top:26616;left:698"><nobr>114 M</nobr></div> | |
<div style="position:absolute;top:26616;left:785"><nobr>1.2 G</nobr></div> | |
<div style="position:absolute;top:26653;left:90"><nobr>Clang</nobr></div> | |
<div style="position:absolute;top:26653;left:188"><nobr>15 M</nobr></div> | |
<div style="position:absolute;top:26653;left:271"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26653;left:329"><nobr>89 M</nobr></div> | |
<div style="position:absolute;top:26653;left:401"><nobr>80 M</nobr></div> | |
<div style="position:absolute;top:26653;left:491"><nobr>0</nobr></div> | |
<div style="position:absolute;top:26653;left:554"><nobr>154 M</nobr></div> | |
<div style="position:absolute;top:26653;left:630"><nobr>15 M</nobr></div> | |
<div style="position:absolute;top:26653;left:702"><nobr>34 M</nobr></div> | |
<div style="position:absolute;top:26653;left:783"><nobr>124 M</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:26796;left:108"><nobr>Final Optimized Binary Size Bloat</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:26853;left:108"><nobr>The table below shows the final optimized binary sizes with the Propeller configurations. BOLT’s</nobr></div> | |
<div style="position:absolute;top:26875;left:108"><nobr>binary sizes for Search1 and Storage benchmark are smaller than expected as there was a seg.</nobr></div> | |
<div style="position:absolute;top:26898;left:108"><nobr>fault with BOLT when trying to build with DebugInfo.</nobr></div> | |
<div style="position:absolute;top:26943;left:108"><nobr>With Propeller, the bloat in final binary sizes is mainly due to CFI information. This overhead</nobr></div> | |
<div style="position:absolute;top:26965;left:108"><nobr>makes a very compelling case for CFI to support dis-contiguous ranges like DebugInfo and</nobr></div> | |
<div style="position:absolute;top:26988;left:108"><nobr>almost all of this overhead can be eliminated with that support. There is no overhead due to</nobr></div> | |
<div style="position:absolute;top:27010;left:108"><nobr>basic block labels as unnecessary label symbols are deleted and only the first basic block of</nobr></div> | |
<div style="position:absolute;top:27033;left:108"><nobr>every function part that has been split is retained for debugging.</nobr></div> | |
<div style="position:absolute;top:27109;left:61"><nobr>Benchmarks</nobr></div> | |
<div style="position:absolute;top:27109;left:214"><nobr>Baseline</nobr></div> | |
<div style="position:absolute;top:27109;left:398"><nobr>List</nobr></div> | |
<div style="position:absolute;top:27109;left:573"><nobr>All</nobr></div> | |
<div style="position:absolute;top:27109;left:745"><nobr>BOLT</nobr></div> | |
<div style="position:absolute;top:27147;left:174"><nobr>.ehframe</nobr></div> | |
<div style="position:absolute;top:27147;left:270"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:27147;left:340"><nobr>.ehframe</nobr></div> | |
<div style="position:absolute;top:27147;left:439"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:27147;left:513"><nobr>.ehframe</nobr></div> | |
<div style="position:absolute;top:27147;left:613"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:27147;left:693"><nobr>.ehframe</nobr></div> | |
<div style="position:absolute;top:27147;left:807"><nobr>Total</nobr></div> | |
<div style="position:absolute;top:27184;left:77"><nobr>Search1</nobr></div> | |
<div style="position:absolute;top:27184;left:181"><nobr>63.7 M</nobr></div> | |
<div style="position:absolute;top:27184;left:268"><nobr>1.7 G</nobr></div> | |
<div style="position:absolute;top:27184;left:348"><nobr>63.7 M</nobr></div> | |
<div style="position:absolute;top:27184;left:437"><nobr>1.9 G</nobr></div> | |
<div style="position:absolute;top:27184;left:515"><nobr>469.2 M</nobr></div> | |
<div style="position:absolute;top:27184;left:611"><nobr>2.7 G</nobr></div> | |
<div style="position:absolute;top:27184;left:708"><nobr>99 M</nobr></div> | |
<div style="position:absolute;top:27184;left:804"><nobr>1.1 G</nobr></div> | |
<div style="position:absolute;top:27221;left:77"><nobr>Search2</nobr></div> | |
<div style="position:absolute;top:27221;left:181"><nobr>63.9 M</nobr></div> | |
<div style="position:absolute;top:27221;left:268"><nobr>1.7 G</nobr></div> | |
<div style="position:absolute;top:27221;left:348"><nobr>72.1 M</nobr></div> | |
<div style="position:absolute;top:27221;left:437"><nobr>1.9 G</nobr></div> | |
<div style="position:absolute;top:27221;left:515"><nobr>466.4 M</nobr></div> | |
<div style="position:absolute;top:27221;left:611"><nobr>2.7 G</nobr></div> | |
<div style="position:absolute;top:27221;left:703"><nobr>116 M</nobr></div> | |
<div style="position:absolute;top:27221;left:804"><nobr>2.4 G</nobr></div> | |
<div style="position:absolute;top:27258;left:78"><nobr>Storage</nobr></div> | |
<div style="position:absolute;top:27258;left:181"><nobr>18.2 M</nobr></div> | |
<div style="position:absolute;top:27258;left:268"><nobr>1.1 G</nobr></div> | |
<div style="position:absolute;top:27258;left:348"><nobr>23.3 M</nobr></div> | |
<div style="position:absolute;top:27258;left:437"><nobr>1.4 G</nobr></div> | |
<div style="position:absolute;top:27258;left:515"><nobr>199.2 M</nobr></div> | |
<div style="position:absolute;top:27258;left:611"><nobr>1.9 G</nobr></div> | |
<div style="position:absolute;top:27258;left:708"><nobr>32 M</nobr></div> | |
<div style="position:absolute;top:27258;left:795"><nobr>717.8 M</nobr></div> | |
<div style="position:absolute;top:27295;left:86"><nobr>Clang</nobr></div> | |
<div style="position:absolute;top:27295;left:192"><nobr>7 M</nobr></div> | |
<div style="position:absolute;top:27295;left:259"><nobr>123.6 M</nobr></div> | |
<div style="position:absolute;top:27295;left:348"><nobr>46.4 M</nobr></div> | |
<div style="position:absolute;top:27295;left:428"><nobr>153.6 M</nobr></div> | |
<div style="position:absolute;top:27295;left:515"><nobr>121.6 M</nobr></div> | |
<div style="position:absolute;top:27295;left:609"><nobr>270 M</nobr></div> | |
<div style="position:absolute;top:27295;left:701"><nobr>11.3 M</nobr></div> | |
<div style="position:absolute;top:27295;left:795"><nobr>244.3 M </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:27499;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="24"><b>Page 24</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:21px;font-family:Times"> | |
<div style="position:absolute;top:27635;left:108"><nobr>Performance Experiments</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:27699;left:108"><nobr>We could build Storage with BOLT but we could not get it to run with BOLT correctly. For code</nobr></div> | |
<div style="position:absolute;top:27721;left:108"><nobr>layout optimizations, both BOLT and Propeller show very similar speed-ups as the same</nobr></div> | |
<div style="position:absolute;top:27744;left:108"><nobr>algorithm has been used to re-order the code. </nobr></div> | |
<div style="position:absolute;top:27789;left:108"><nobr>The table below shows the performance numbers of the large benchmarks with Propeller and</nobr></div> | |
<div style="position:absolute;top:27811;left:108"><nobr>BOLT comparing it to the baseline binary. The performance wins for these large benchmarks</nobr></div> | |
<div style="position:absolute;top:27834;left:108"><nobr>with Propeller are impressive, ThinLTO gave similar speed-ups in performance over a</nobr></div> | |
<div style="position:absolute;top:27856;left:108"><nobr>non-ThinLTO baseline. Further, the "List" configuration does as well as the "All" configuration in</nobr></div> | |
<div style="position:absolute;top:27879;left:108"><nobr>performance. It is clear from these results that using the "List" configuration keeps the memory</nobr></div> | |
<div style="position:absolute;top:27901;left:108"><nobr>and time overheads low without losing out on performance. BOLT’s slow-down of the Search1</nobr></div> | |
<div style="position:absolute;top:27924;left:108"><nobr>benchmark is primarily due to text huge pages getting disabled with BOLT as it keeps the</nobr></div> | |
<div style="position:absolute;top:27946;left:108"><nobr>optimized code in a separate segment. We ran an experiment to measure BOLT’s performance</nobr></div> | |
<div style="position:absolute;top:27969;left:108"><nobr>improvement over the baseline binary when hugepages were explicitly disabled for both. In this</nobr></div> | |
<div style="position:absolute;top:27991;left:108"><nobr>case, BOLT’s performance improvements over the baseline binary were closer, though still</nobr></div> | |
<div style="position:absolute;top:28014;left:108"><nobr>smaller, to Propeller’s improvements</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:28060;left:108"><nobr>Large Benchmarks</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:28169;left:132"><nobr>Benchmarks</nobr></div> | |
<div style="position:absolute;top:28169;left:295"><nobr>Metric</nobr></div> | |
<div style="position:absolute;top:28149;left:534"><nobr>% Improvements</nobr></div> | |
<div style="position:absolute;top:28186;left:493"><nobr>Propeller</nobr></div> | |
<div style="position:absolute;top:28186;left:713"><nobr>BOLT</nobr></div> | |
<div style="position:absolute;top:28223;left:443"><nobr>List</nobr></div> | |
<div style="position:absolute;top:28223;left:586"><nobr>All</nobr></div> | |
<div style="position:absolute;top:28260;left:146"><nobr>Search1</nobr></div> | |
<div style="position:absolute;top:28260;left:299"><nobr>QPS</nobr></div> | |
<div style="position:absolute;top:28260;left:435"><nobr>2.6 %</nobr></div> | |
<div style="position:absolute;top:28260;left:575"><nobr>2.6 %</nobr></div> | |
<div style="position:absolute;top:28260;left:714"><nobr>2.0 %</nobr></div> | |
<div style="position:absolute;top:28297;left:146"><nobr>Search2</nobr></div> | |
<div style="position:absolute;top:28297;left:299"><nobr>QPS</nobr></div> | |
<div style="position:absolute;top:28297;left:435"><nobr>5.8 %</nobr></div> | |
<div style="position:absolute;top:28297;left:575"><nobr>5.9 %</nobr></div> | |
<div style="position:absolute;top:28297;left:712"><nobr>SEGV</nobr></div> | |
<div style="position:absolute;top:28335;left:149"><nobr>Storage</nobr></div> | |
<div style="position:absolute;top:28335;left:288"><nobr>Latency</nobr></div> | |
<div style="position:absolute;top:28335;left:442"><nobr>6 %</nobr></div> | |
<div style="position:absolute;top:28335;left:582"><nobr>6 %</nobr></div> | |
<div style="position:absolute;top:28335;left:712"><nobr>SEGV</nobr></div> | |
<div style="position:absolute;top:28372;left:155"><nobr>Clang</nobr></div> | |
<div style="position:absolute;top:28372;left:263"><nobr>Bootstrap time</nobr></div> | |
<div style="position:absolute;top:28372;left:442"><nobr>9 %</nobr></div> | |
<div style="position:absolute;top:28372;left:582"><nobr>9 %</nobr></div> | |
<div style="position:absolute;top:28372;left:721"><nobr>9 % </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:28687;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="25"><b>Page 25</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:28920;left:108"><nobr>SPEC benchmarks and PMU data</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:28999;left:108"><nobr>The figures below show the PMU data of all the benchmarks for the events : icache missses,</nobr></div> | |
<div style="position:absolute;top:29022;left:108"><nobr>branches not taken and cycles. The percentage difference in the counts of each event relative</nobr></div> | |
<div style="position:absolute;top:29044;left:108"><nobr>to the baseline is shown. For SPEC, BOLT and Propeller have a similar regression effect on the</nobr></div> | |
<div style="position:absolute;top:29067;left:108"><nobr>cycle count. On average, BOLT, Propeller-List, and Propeller-All increase the cycle count by</nobr></div> | |
<div style="position:absolute;top:29089;left:108"><nobr>1.0%, 1.5%, and 1.6%, respectively. Across all 8 benchmarks, Bolt wins over Propeller by more</nobr></div> | |
<div style="position:absolute;top:29112;left:108"><nobr>than 0.5% on 5 benchmarks (perlbench, x264, deepsjeng, leela, and xz), while Propeller-List</nobr></div> | |
<div style="position:absolute;top:29134;left:108"><nobr>wins on 1 (xalancbmk). The code layout optimization tends to increase the number of non_taken</nobr></div> | |
<div style="position:absolute;top:29157;left:108"><nobr>branches which consistently increases on all benchmarks, with BOLT and Propeller being very</nobr></div> | |
<div style="position:absolute;top:29179;left:108"><nobr>similar. The icache miss data, Figure 5c also show a lower number of misses with Propeller and</nobr></div> | |
<div style="position:absolute;top:29202;left:108"><nobr>BOLT compared to baseline, which is an expected outcome from the optimization. However, the</nobr></div> | |
<div style="position:absolute;top:29224;left:108"><nobr>dynamic instruction counts increase on 4/8 SPEC benchmarks with mcf being the highest. This</nobr></div> | |
<div style="position:absolute;top:29247;left:108"><nobr>is because changing the code layout could replace an implicit fall through with a direct branch.</nobr></div> | |
<div style="position:absolute;top:29292;left:108"><nobr>For SPEC benchmarks, even though the code layout optimization seems to be doing the right</nobr></div> | |
<div style="position:absolute;top:29314;left:108"><nobr>things, the performance numbers show no improvements on most benchmarks and regressions</nobr></div> | |
<div style="position:absolute;top:29337;left:108"><nobr>on a couple. We did a top-down analysis on these benchmarks and looked at the % of issue</nobr></div> | |
<div style="position:absolute;top:29359;left:108"><nobr>slots wasted waiting on front-end events, which is a metric for their frontend boundness. For</nobr></div> | |
<div style="position:absolute;top:29382;left:108"><nobr>SPEC benchmarks, this value is about 4% − 12%, whereas for the large benchmarks it is more</nobr></div> | |
<div style="position:absolute;top:29404;left:108"><nobr>than 20% and up to 38% for clang. Code layout optimizations reduce front-end stalls and hence</nobr></div> | |
<div style="position:absolute;top:29427;left:108"><nobr>tend to positively impact benchmarks that are more front-end bound. </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:29875;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="26"><b>Page 26</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:30008;left:108"><nobr>Icache Data</nobr></div> | |
<div style="position:absolute;top:30447;left:108"><nobr>Branches Not taken Data </nobr></div> | |
</span></font> | |
<div style="position:absolute;top:31063;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="27"><b>Page 27</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343"> | |
<div style="position:absolute;top:31196;left:108"><nobr>Cycles Data</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:31660;left:108"><nobr>Conclusion</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:31732;left:108"><nobr>Propeller is a new framework for post link optimizations that has been built for scalability and</nobr></div> | |
<div style="position:absolute;top:31755;left:108"><nobr>has lower memory and time overheads compared to the state-of-the-art, Facebook’s BOLT.</nobr></div> | |
<div style="position:absolute;top:31777;left:108"><nobr>Propeller re-links the binary from optimized LLVM IR and achieves scalability by distributing the</nobr></div> | |
<div style="position:absolute;top:31800;left:108"><nobr>heavy weight optimization tasks to the compiler back-ends which can be executed in parallel</nobr></div> | |
<div style="position:absolute;top:31822;left:108"><nobr>with several threads or distributed across several machines. Propeller introduces basic block</nobr></div> | |
<div style="position:absolute;top:31845;left:108"><nobr>sections which keeps the actual link light-weight and hence the memory and time overheads</nobr></div> | |
<div style="position:absolute;top:31867;left:108"><nobr>smaller. Propeller has been implemented as a part of the LLVM framework and experimental</nobr></div> | |
<div style="position:absolute;top:31890;left:108"><nobr>results show that Propeller produces similar performance gains to BOLT but with much less</nobr></div> | |
<div style="position:absolute;top:31912;left:108"><nobr>memory and time overheads.</nobr></div> | |
</span></font> | |
<font size="4" face="Times"><span style="font-size:27px;font-family:Times"> | |
<div style="position:absolute;top:32010;left:108"><nobr>References</nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:32082;left:135"><nobr>1. Facebook BOLT presentation:</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc"> | |
<div style="position:absolute;top:32104;left:162"><nobr><a href="https://llvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf">https://llvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf</a></nobr></div> | |
</span></font> | |
<div style="position:absolute;top:32251;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="28"><b>Page 28</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:32360;left:135"><nobr>2. Facebook BOLT video: <a href="https://www.youtube.com/watch?v=gw3iDO3By5Y"></a><font color="#1155cc"><a href="https://www.youtube.com/watch?v=gw3iDO3By5Y">https://www.youtube.com/watch?v=gw3iDO3By5Y</a></font></nobr></div> | |
<div style="position:absolute;top:32382;left:135"><nobr>3. BOLT blog post:</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc"> | |
<div style="position:absolute;top:32405;left:162"><nobr><a href="https://code.facebook.com/posts/605721433136474/accelerate-large-scale-applications-with-bolt/">https://code.facebook.com/posts/605721433136474/accelerate-large-scale-applications-</a></nobr></div> | |
<div style="position:absolute;top:32427;left:162"><nobr><a href="https://code.facebook.com/posts/605721433136474/accelerate-large-scale-applications-with-bolt/">with-bolt/</a></nobr></div> | |
</span></font> | |
<font size="3" face="Times"><span style="font-size:14px;font-family:Times"> | |
<div style="position:absolute;top:32450;left:135"><nobr>4. Improved Basic Block Reordering: <font color="#1155cc"><a href="https://arxiv.org/abs/1809.04676">https://arxiv.org/abs/1809.04676</a></font></nobr></div> | |
<div style="position:absolute;top:32472;left:135"><nobr>5. Optimizing function placement for large-scale data-center applications:</nobr></div> | |
</span></font> | |
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc"> | |
<div style="position:absolute;top:32495;left:162"><nobr><a href="https://dl.acm.org/citation.cfm?id=3049858">https://dl.acm.org/citation.cfm?id=3049858</a></nobr></div> | |
</span></font> | |
</div></body></html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment