theme | class | highlighter | fonts | ||||||
---|---|---|---|---|---|---|---|---|---|
default |
text-center |
MaskRay |
|
MaskRay
- Worked on lld codebase since 2018. Maintaining it since lld 9 (now is 17)
- Debugged hundreds of ELF linker issues: internal, Android, ChromeOS, Linux kernel, misc llvm-dev questions
- Studied many
lld/test/ELF/
tests (more than half) - Read much of llvm-{nm,readobj,objcopy,objdump,...}/musl/FreeBSD/glibc ld.so code: they help understand how the linker output affects consumers
- Filed ~100 binutils bugs
- Not fancy as many other toolchain components
- Some basics are useful
- binary utilities
- Object file format, compiler instrumentations
Object file formats are very different.
- a.out (extensions): SunOS a.out ld, NetBSD a.out ld, GNU ld
- PE/COFF: MSVC link.exe, GNU ld, lld-link
- XCOFF: AIX linker
- Mach-O: Apple ld64, michaeleisel/zld, ld64.lld, mold/macho
- ELF: Solaris ld, GNU ld, GNU gold, MC linker, ld.lld, mold/elf
- WebAssembly: wasm-ld
graph TD
a.o --> Linker
b.o --> Linker
libc.so --> Linker
libd.a --> Linker
Linker --> a.out
Combine input sections
graph TD
file0 --> 0text{.text}
file0 --> 0data{.data}
file1 --> 1text{.text}
file1 --> 1data{.data}
0text --> .text
1text --> .text
0data --> .data
1data --> .data
.text --> a.out
.data --> a.out
Symbol resolution
- local symbols: X, Y, ...
- non-local (STB_GLOBAL or STB_WEAK) symbols: A, B, ...
graph TD
0.o --> X0[X] --> a.out
0.o --> Y0[Y] --> a.out
0.o --> A0["A"] --> A --> a.out
0.o --> B0["B (weak)"] -.-> B
1.o --> X1[X] --> a.out
1.o --> A1["A (undefined)"] -.-> A
1.o --> B1["B"] --> B --> a.out
1.o --> C1 --> C --> a.out
1.o --> D1["D (undefined)"] -.-> D
2.so --> C2 -.-> C
2.so --> D2 --> D["D (shared)"] --> a.out
- Parse command line options
- Find and scan input files (
.o
,.so
,.a
), interleaved with symbol resolution - Global transforms (section based garbage collection, identical code folding, etc)
- Create synthetic sections
- Map input sections and synthetic (linker-generated) sections into output sections
- Scan relocations
- Layout and assign addresses (thunks, sections with varying sizes, symbol assignments)
- Write headers
- Copy section contents to output and resolve relocations
Find input files
- Read and parse input files one by one
-L
provides search paths.-l
looks up them--sysroot
affectsINPUT
/GROUP
in linker scripts-Bdynamic
finds both .so and .a.-Bstatic
finds just .a
Scan input files, interleaved with symbol resolution
- input files: relocatable object files (
.o
), shared objects (.so
), archives (.a
), linker scripts (commonly.lds
and.t
) - File extensions are used by
-l
, but otherwise have no special semantics - symbols: defined (
.o
), shared (.so
defined), lazy (.a
defined), undefined - Relocatable object files contributes sections, symbols, and relocations
- Shared objects contributes just symbols
- Archives provide so-called lazy definitions: either like unused undefined symbol or archive member extraction
- input sections:
.rodata
,.text
,.data
,.bss
, ... - entries in symbol table (
SHT_SYMTAB
): defined, undefined - relocations: record them to be scanned later
The local part is file-local while the non-local part requires resolution with other files.
The non-local part consists of STB_GLOBAL
, STB_WEAK
, and (GCC specific) STB_GNU_UNIQUE
symbols.
For each entry, reserve a symbol in the global symbol table if not present yet.
A shared object contributes just symbols to the link process.
- entries in dynamic symbol table (
SHT_DYNSYM
): shared, undefined - symbol versioning:
SHT_GNU_versym
,SHT_GNU_verdef
, (for--no-allow-shlib-undefined
and non-default version symbol export)SHT_GNU_verneed
(mold < gold < GNU ld ~ ld.lld) DT_SONAME
, (for--no-allow-shlib-undefined
)DT_NEEDED
This is an ancient trick to decrease linker work: initially every archive member is in a lazy state, if an archive member provides a definition resolving an undefined symbol, it will be extracted.
Extraction may trigger recursive extraction within (default) the archive or (--start-group
) other archives.
https://maskray.me/blog/2021-06-20-symbol-processing#archive-processing
- GNU ld, gold style
- VMS, Apple ld64, MSVC link.exe, ld.lld, mold
GNU ld/gold style is not user-friendly because the undefined reference
diagnostic is difficult to act on.
However, it has a loose layering check property.
ld.lld --warn-backrefs
achieves goods of both worlds
How do a non-local symbol interract with another symbol of the name in other object files?
Resolution between two .o
defined/undefined symbols is associative.
In the absence of weak symbols and duplicate definition errors, the interaction is also commutative.
Resolution between a .o
and a .so
is commutative and associative.
Resolution between two .so
is associative.
If two .so
don't define the same symbol, the resolution is also commutative.
For an archive symbol, its interaction with any other file kind is neither commutative nor associative.
- type, binding, visibility, size
- defined (definition in
.o
): section, value; (for bitcode) file- to support linker scripts,
section
either an output section or an input section
- to support linker scripts,
- undefined (undefined in .o/.so): (for diagnostics) file
- shared (definition in .so): file, (for copy relocations) inferred alignment
- placeholder:
--trace-symbol
, (for symbol versioning) eliminated symbols
- initialization of local symbols (embarrassingly parallel)
- initialization of non-local symbols
- symbol resolution
To diagnose relocations referencing a local symbol defined in a discarded section (due to COMDAT rules), a local symbol may be represented as Undefined
.
We need COMDAT group resolution to decide whether a local symbol is Defined
or Undefined
.
Parallelizing section initialization is feasible for relocatable object files but may increase work for archives and lazy object files. If we know that many archive members will be eventually extracted, eager initializing sections may be a good idea.
- Symbol representation needs some atomics
- Symbol resolution needs multiple passes
.a .so
vs.so .a
semantics are not preserved as I like (it works in 99.9% cases)- In one large executable, I measured:
- 1 thread: 1.3x slower than ld.lld (atomics have costs)
- 2 threads: 1.75x faster than 1 thread
- 4 threads: 3x faster than 1 thread
- 8 threads: 5.3x faster than 1 thread
- 16 threads: 5.4x faster than 1 thread
--gc-sections
: section based garbage collection (mark and sweep).
Linker garbage collection
In ld.lld, for a -ffunction-sections
build, GC can take little time (2%) if there is debug information or a lot (10%) when there is little debug information, thanks to the monolithic .debug_*
design
--icf
: graph isomorphism problem
- link.exe: Identical COMDAT Folding
- gold/ld.lld/mold: Identical Code Folding
Algorithms
- pessimistic: gold joins identical sections (pessimistic approach cannot merge some mutual recursive functions)
- optimistic: ld.lld/mold split distinct sections
ld.lld computes a lightweight message digest for each section, then collects hashes into buckets. For each bucket, it does pairwise comparison which is easy to reason about and supports condition branches well.
Another idea is to use message digests to fully describe a section (mold).
.got
: Global Offset Table.plt
: Procedure Linkage Table.dynsym, .dynstr
: dynamic symbol table and string table.dynamic
: dynamic linkage information.rela.dyn, .rela.plt, .relr.dyn
: dynamic relocations.hash, .gnu.hash
: SysV hash table, GNU hash table.symtab, .strtab
: static symbol table and string table- section header table,
.shstrtab
Dynamic sections (.dynsym, dynstr, .dynamic, .rela.dyn, .rela.plt
) are for dynamic linking
- Find output section for each input section (.text.foo -> .text, .rodata.bar -> .bar)
- Combine input sections
graph TD
file0 --> 0text{.text}
file0 --> 0data{.data}
file1 --> 1text{.text}
file1 --> 1data{.data}
0text --> .text
1text --> .text
0data --> .data
1data --> .data
.text --> a.out
.data --> a.out
Linker scripts (a powerful domain specific language): output section descriptions, input section descriptions
.text : {
*(.text .text.*)
}
.data : {
a.o(.data)
*:*(.data)
KEEP(*(EXCLUDE_FILE(*notinclude) .data))
}
.init_array : {
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
...
}
Write a parser inside the linker!
Symbol value: absolute value, input section plus offset, output section plus offset
- An input section may have a dependent
SHT_REL
/SHT_RELA
- Scan relocation sections to decide link-time constants, GOT, PLT, copy relocations, canonical PLT entries, TLS dynamic relocations, etc
- Relocations have different kinds
They may result into:
- Link-time constants: resolved
- GOT: .got, .rela.dyn
- PLT: .plt, .got.plt, .rela.plt
- non-preemptible ifunc: GNU indirect function
Linker is a validation tool.
- Error checking: undefined reference, relocation overflow, PC-relative referencing
SHN_ABS
in PIC links, relocation referencing a discarded section
Some architectures have unfortunate rules: Hexagon TLS, Mips multi-GOT, PowerPC .got2
- GOT optimization: x86-64
R_X86_64_REX_GOTPCRELX
, (recent) AArch64, PowerPC64 TOC .plt.got
optimization- RISC-V linker relaxation may or may not be placed here
- stack machine: LoongArch, 3 old architectures in binutils
mold's parallel scan (some atomics, out-of-order diagnostics):
- 1 thread: 2.35x faster than ld.lld
- 2 threads: 1.88x faster than 1 thread
- 4 threads: 3.28x faster than 1 thread
- 16 threads: 11.13x faster than 1 thread
Why is ld.lld slow? I think fundamentally it is because ld.lld does more work. It has more dispatches and abstraction costs (e.g. virtual function), and provides more diagnostics and handles many tricky cases that mold doesn't do:
- support all of REL/RELA/mips64el for one architecture (even if the architecture typically doesn't use REL)
- use a generic relocation expression (RelExpr) design which costs two virtual functions processing one relocation
- GNU ifunc
- retain otherwise unused
.got
and.got.plt
in edge cases - ...
With little debug information, in ld.lld --threads={1,2,4,8}, relocation scan takes 11~15% of total time. IMO parallism does not pay off if we don't optimize other critical paths.
- Compute the initial address
- Get output section sizes
- Assign addresses to output sections and symbols
- Assign addresses to input sections, data commands, and symbols
- Text sections may be interspersed with range-extension thunks
- Variable size sections (
SHT_RELR
): Mach-O avoided the problem by placing load commands at the end
Linker scripts
.
and symbol assignments, data commands in linker scripts- virtual address and load address
- VMA/LMA memory region
for (SectionCommand *cmd : sectionCommands) {
if (auto *assign = dyn_cast<SymbolAssignment>(cmd)) {
assign->addr = dot;
assignSymbol(assign, false); // outside an output section description
} else {
assignOffsets(cast<OutputSection>(cmd));
}
}
The representation of an input section includes a parent
pointer to the output section and an offset within in output section.
for (SectionCommand *cmd : sec->commands) {
if (auto *assign = dyn_cast<SymbolAssignment>(cmd)) {
// .text : { ... sym = .; }
assign->addr = dot;
assignSymbol(assign, true); // in an output section description
} else if (auto *data = dyn_cast<ByteCommand>(cmd)) {
// .text : { ... BYTE(0) }
data->offset = dot - sec->addr;
dot += data->size;
} else {
// .text : { ... *(.text) }
for (InputSection *isec : cast<InputSectionDescription>(cmd)->sections) {
dot = alignTo(dot, isec->alignment);
isec->outSecOff = dot - sec->addr; // offset within the output section
dot += isec->getSize();
}
}
}
veneers/stub groups/range extension thunks
Every new arch needs to think about this
- stability: GNU ld < gold < ld.lld
- ld64 branch islands
- ELF header
- program headers: linker scripts can make the placement complex
- section header table
- Copy input section contents to output sections
- The content of a synthetic section needs computing
- Resolve relocations
- Compress sections
.eh_frame
SHF_LINK_ORDER
- section groups
-Map
: link map--cref
: cross reference table--why-extract
: why is an archive member extracted?
- Abstraction to support both ELFCLASS32 and ELFCLASS64: Do symbol/section structs need templates? Do functions need templates?
- Abstraction to support both little-endian and big-endian (big-endian systems have declined 90% in the last ten years)
- Support all architectures in one build? gold, ld.lld, and mold do. GNU ld needs -m
- Support both REL/RELA as input/output for one arch? ld.lld does. GNU ld, gold, and mold don't
- Global states vs explicit context parameter
- Portability, platform differences
- Standalone executable or library
- Report symbol resolution result to the compiler
- Pass optimization/code generation options to the compiler
Videos
- New LLD linker for ELF Rui Ueyama, 2016 EuroLLVM
- What makes LLD so fast? by Peter Smith, FOSDEM 2019
Books, articles
- Linkers and Loaders John R. Levine
- Ian Lance Taylor's 20 part Linker posts
- https://wiki.freebsd.org/LLD
- https://maskray.me/blog/tags/linker/
With multi-threading, why is it faster than ld.lld?
Ordered by my rough estimate of importance:
SHF_MERGE
duplicate elimination- parallel symbol table initialization (major)
- eager loading of archive members and lazy object files
- parallel relocation scan
- faster synthetic sections
- parallel
--gc-sections
- parallel symbol resolution