A Deep Dive into `libgcc` And `libsupc++`

According to the specifications of the C and the C++ programming languages, implementations of C and C++ can be classified into hosted ones and freestanding ones, depending on whether the implementation has access to functionalities that require operating system (OS) support, such as memory allocation and multi-threading. A hosted implementation has full access to these functionalities, and can provide the full range of features required by the language. A freestanding implementation, on the other hand, does not have access to any functionality that requires support from the execution environment. Such implementations are only required to provide a subset of the language features. Freestanding implementations are important when developing operating systems, the standard C library, bare-metal embedded applications, or anything that cannot depend on a system-provided standard C library.

The GCC compiler has long provided a "freestanding" mode that does not assume the existence of a standard C library. This is the mode used to compile the Linux kernel and the GNU C library. However, a somewhat counterintuitive aspect of the freestanding mode is that, GCC may still emit calls to a few libraries that come as part of the GCC installation. These libraries are libgcc and libsupc++. There is no way to prevent GCC from emitting calls to these libraries. Therefore, it is important to understand the functionalities supplied by these libraries, as well as when and why GCC emits calls to these libraries.

The String Library

Before talking about libgcc and libsupc++, it is worth pointing out here that besides the functions exported by libgcc and libsupc++, GCC may also emit calls to the four functions memset, memcpy, memmove, and memcmp that are normally provided by the standard C library. This is also the case with Clang. Since neither libgcc nor libsupc++ exports these functions, any bare-metal software project should be prepared to provide implementations of these functions. Fortunately, it is not hard to find high-quality implementations of these functions. For example, the Linux kernel provides generic versions of these functions in lib/string.c. It also contains architecture-specific optimized implementations written in assembly.

The Experiment Setting

We build a minimal version of GCC 14.2.0 targetting aarch64-none-elf with the following configuration:

#!/bin/sh
export TARGET=aarch64-none-elf
export PREFIX=/opt/aarch64-none-elf
export PROG_PREFIX=aarch64-none-elf-
../gcc-14.2.0/configure \
--target=$TARGET \
--prefix=$PREFIX \
--program-prefix=$PROG_PREFIX \
--with-as=/opt/aarch64-none-elf/bin/aarch64-none-elf-as \
--with-ld=/opt/aarch64-none-elf/bin/aarch64-none-elf-ld \
--without-headers \
--with-newlib \
--disable-nls \
--disable-gcov \
--disable-tls \
--disable-tm-clone-registry \
--disable-bootstrap \
--disable-shared \
--disable-multilib \
--disable-threads \
--disable-lto \
--disable-libatomic \
--disable-libgomp \
--disable-libquadmath \
--disable-libsanitizer \
--disable-libssp \
--disable-libvtv \
--disable-hosted-libstdcxx \
--disable-checking \
--enable-languages=c,c++

After building, under the aarch64-none-elf directory one can find two subdirectories libgcc and libstdc++-v3. Since the standard C library is not available, the complete libstdc++ is not built. Instead, a minimal subset called libsupc++ is built. It is located under libstdc++-v3/libsupc++. Both libgcc and libsupc++ can be statically linked. The library archives are libgcc/libgcc.a and libstdc++-v3/libsupc++/.libs/libsupc++.a.

Most of the remaining content of this blog post is based on experimenting with this build of GCC. We will first analyze libgcc, which is required by both gcc and g++, and then move on to libsupc++, which is required only by g++.

`libgcc` Part I: Arithmetic Functions

A major part of libgcc consists of functions that emulate arithmetic functions that cannot be easily realized on a given platform. Most of these functions are documented in https://gcc.gnu.org/onlinedocs/gccint/Libgcc.html. If it is not, its functionality can be found by reading the source code. However, note that GCC uses machine modes (https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html) to type the arguments of these functions, rather than the usual types likes int.

`libgcc` Part II: Atomic Operations

It might seem a bit surprising, but on AArch64 each atomic operation, except atomic reads and writes, requires a function call to libgcc. These include the read-modify-write operations, the swap operation, and the compare-and-swap (CAS) operation.

The reason is that on AArch64 platforms there are two ways to perform complex atomic operations. On ARMv8.0-A platforms, the only possible way is to use the load-link and store-conditional (LL/SC) instructions. See my other blog post (https://github.com/CharlieQiu2017/anderson_moir) for an introduction to these instructions. On systems with many CPU cores, the LL/SC-based implementation has been found to be inefficient. For this reason, ARM introduced the Large System Extension (LSE) in ARMv8.1-A that contains single-instruction atomic operations.

Now when the compiler encounters a complex atomic operation, it has a choice between using these two ways to implement the opeartion. Unless a specific CPU target is selected by -mcpu or -march, the compiler does not know whether the target supports LSE or not. The decision made by GCC developers is to introduce a function that detects at runtime whether the platform supports LSE or not. This function is contained in lse-init.o, but it is not compiled for bare-metal targets. If LSE support is detected, then each atomic operation uses the LSE implementation. Otherwise, it falls back to the LL/SC implementation.

Calls to the libgcc atomic functions can be disabled with -mno-outline-atomics. In this case all complex atomic operations use the LL/SC implementation. Alternatively, if -mcpu or -march specifies a target that supports LSE, then all complex atomic operations use the LSE implementation.

`libgcc` Part III: C Runtime Initialization and Termination

The files to be considered here are crti.o, crtn.o, crtbegin.o, crtend.o, crtfastmath.o, and _ctors.o. However, the GNU C library also supplies the files crti.o and crtn.o, which leads to some confusion. Good introductions to this topic are https://gcc.gnu.org/onlinedocs/gccint/Initialization.html, https://maskray.me/blog/2021-11-07-init-ctors-init-array, and https://stackoverflow.com/questions/22160888/what-is-the-difference-between-crtbegin-o-crtbegint-o-and-crtbegins-o.

On targets where GCC expects a standard C library to be present, it expects the standard C library to supply the files crti.o and crtn.o. Consequently, the versions provided by libgcc are not installed. However, on bare-metal targets like aarch64-none-elf, the libgcc versions are used.

The purpose of all these files is to arrange for global constructors to run at the beginning of process execution, and global destructors to run at the end of process execution. This is especially important in C++ where global variables may have non-trivial constructors. To keep things simple, we consider only static executables here.

ELF process initialization roughly works as follows. The process entry point is a function called _start(). It is supplied by the standard C library and lies in a file called crt0.o (on some platforms, crt1.o and even crt2.o). There is nothing special about the _start() function. It lies in the .text section just like any other function. After the Linux kernel maps the executable into memory, it jumps to the _start function immediately.

In the very old days, the executable also contained two special sections named .init and .fini. Each section contains a single function, called _init() and _fini(), respectively. The _start function typically calls a function called __libc_start_main(). This function first calls the _init() function, then calls the main() function of the executable, and finally calls the _fini() function.

The actual situation is a bit more complex, because:

If the main() function does not return, but calls exit() instead, the global destructors must still be executed;
The standard C library allows programs to dynamically register global destructors via atexit().

In the GNU C library, __libc_start_main() is defined in csu/libc-start.c:

STATIC int LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
			    int argc,
			    char **argv,
#ifdef LIBC_START_MAIN_AUXVEC_ARG
			    ElfW(auxv_t) *auxvec,
#endif
			    __typeof (main) init, // Unused
			    void (*fini) (void), // Unused
			    void (*rtld_fini) (void),
			    void *stack_end)
     __attribute__ ((noreturn));

This function will execute __cxa_atexit (call_fini, NULL, NULL), where __cxa_atexit() is an advanced version of atexit() (see https://stackoverflow.com/questions/42912038/what-is-the-difference-between-cxa-atexit-and-atexit). It arranges for a function called call_fini() to run when exit() is called or main() returns. The call_fini() function indirectly executes the _fini() function.

After registering the global destructor, the __libc_start_main() function calls call_init() to run the _init() function. Finally, it calls the main() function.

There is an obvious drawback with the _init() and _fini() approach, which is that only a single constructor and a single destructor function can be defined. Nevertheless, people figured out an extremely clever way to multiplex the _init() and the _fini() functions. First, the compiler or the standard C library supplies two files called crti.o and crtn.o. When the compiler invokes the linker, it places crt0.o and crti.o at the very beginning of the list of object files. The object file crtn.o is placed at the very end.

The compiler constructs a .init section and a .fini section for each object file. Each constructor and destructor function is expected to have the signature void func (void). If a function f is marked as a constructor, then the compiler emits a single instruction that calls f in the .init section. If it is a destructor then the instruction is emitted in the .fini section.

The linker concatenates the .init sections from each object file together into a single .init section. It also concatenates the .fini sections into a single .fini section. The .init section thus looks like:

# .init of crti.o
# Something magic goes here

# .init of your program
call constructor1
call constructor2
...

# .init of crtn.o
# Something magic goes here

The trick is to provide a function prologue in crti.o, and a function epilogue in crtn.o, so that the concatenated .init and .fini sections become complete functions that can be executed. See https://ahl.dtrace.org/2005/09/15/the-mysteries-of-_init/ for how this used to work.

You will see this is all too much hacking and extremely fragile. A slightly improved solution is to keep arrays of pointers to constructor and destructor functions in two additional sections called .ctors and .dtors. Concatenating arrays is much safer and much less bug-prone than concatenating function fragments. Then the .init and .fini sections only need to contain a single call to a function that walks through these lists. The functions that do this work are __do_global_ctors_aux() and __do_global_dtors_aux(). These functions are defined in crtstuff.c and gets compiled into crtbegin.o. For targets that still use the .init and .fini mechanism, crtstuff.c contains statements that cause GCC to emit calls to these functions in .init and .fini.

Finally, in 1999 a new initialization and termination scheme was introduced. Each object file now contains two sections called .init_array and .fini_array. Like .ctors and .dtors described above, they are arrays of pointers to the constructor and destructor functions. The linker concatenates the .init_array sections from all object files into a single section. Then the __libc_start_main() function (instead of the compiler-supplied __do_global_ctors_aux()) walks through the array, calling each function in sequence. The same thing happens to the .fini_array sections. If GCC believes the target supports initialization through this mechanism then it will not emit calls into the .init and the .fini sections. Thus on modern systems the crti.o and crtn.o files can be mostly ignored. The .ctors and .dtors sections are also unused.

Today, if you compile a C++ program under AArch64, then the global constructors and destructors are probably handled as follows. The compiler emits a function called _GLOBAL__sub_I_something() where something is a unique identifier chosen by the compiler. The address of this function gets placed into the .init_array section. This function calls a function called __static_initialization_and_destruction_0(), which calls the constructors of each global variable, and also registers their destructors via __cxa_atexit(). The function __do_global_ctors_aux() is unused (in fact, not compiled into libgcc). However, crtbegin.o contains another constructor function called frame_dummy() that will be explained later. The function __do_global_dtors_aux() still exists and is placed into .fini_array, but it does not call the functions in .fini_array (those are handled by the standard C library). Instead, it is used to destruct some data structures constructed by frame_dummy(). This will also be explained later.

The file crtend.o is mostly empty. It defines a single symbol called __FRAME_END__ that gets placed at the very end of the .eh_frame section. We will explain this later.

Finally, crtfastmath.o contains a single constructor function called set_fast_math(). This file gets linked into the executable if GCC is called with -ffast-math. On each specific platform, it executes instructions that would place the floating-point processing unit (FPU) into a mode that performs floating-point computations faster, possibly at the loss of some precision.

The file _ctors.o is compiled from libgcc2.c and just exports two symbols called __CTOR_LIST__ and __DTOR_LIST__. They exist to avoid certain linker errors:

/* Provide default definitions for the lists of constructors and
   destructors, so that we don't get linker errors.  These symbols are
   intentionally bss symbols, so that gld and/or collect will provide
   the right values.  */

/* We declare the lists here with two elements each,
   so that they are valid empty lists if no other definition is loaded.

   If we are using the old "set" extensions to have the gnu linker
   collect ctors and dtors, then we __CTOR_LIST__ and __DTOR_LIST__
   must be in the bss/common section.

   Long term no port should use those extensions.  But many still do.  */

This file may be ignored.

`libgcc` Part IV: Stack Unwinding and Exception Handling

This part concerns the files unwind-c.o, unwind-dw2.o, and unwind-dw2-fde.o. They provide building blocks for implementing C++ exception handling. There is also unwind-sjlj.o but that file is empty. Here sjlj stands for setjmp/longjmp, but current versions of GCC do not use sjlj exception handling by default. It can still be enabled through a special build flag (see https://gcc.gnu.org/install/configure.html).

The best sources of information for this part of GCC are https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html, https://gcc.gnu.org/wiki/Dwarf2EHNewbiesHowto, and https://maskray.me/blog/2020-12-12-c++-exception-handling-abi. There are lots of other posts on the Internet describing the details of this process. However, when developing bare-metal applications the usual recommendation is to disable exception handling (see https://alex-robenko.gitbook.io/bare_metal_cpp/compiler_output/exceptions). Therefore, it is totally fine to use -fno-exceptions everywhere and ignore this aspect of libgcc.

The C++ exception handling framework used by GCC is called the "Itanium ABI." It is defined in two layers. The first layer ("The Base ABI") is implemented in libgcc. This part is what we will cover here. It is intended to be generic enough to support exception handling in many different programming languages, although "some parts are clearly intended to support C++ features" (see https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html). The second layer ("The C++ ABI") defines how C++ exception handling should be implemented upon the Base ABI. It is implemented in libsupc++ and will be covered later.

An Abstract Model of Exception Handling

Before we dive into the technical details, it is useful to first understand how exceptions work in an abstract model of computation. We first consider an abstract model of computation without exceptions. Let us call it the K model. The state of the K model consists of two parts:

A global state s, which is intended to capture all non-local state variables that the program may access;
A frame stack stk, which is a finite list of frames. The frame stack captures the local state variables that the program has access to. The program is only allowed to access the topmost frame on the frame stack. In other words, each time a new frame is pushed onto the frame stack, the program loses access to all previous frames. When the topmost frame is popped out of the stack, the program regains access to the second topmost frame.

We intentionally do not define the content of the "global state" and each "frame." This makes the model abstract rather than tied to any specific implementation of C or C++.

The reader might question our characterization of the "frame stack" above. For example, in C programs one may take a pointer of a local variable and pass it to some function call, which gives the callee access to variables not in the topmost frame in the stack. This can be emulated in our abstract model by storing the actual local variables in the global state s, and each frame only records the pointers to the local variables.

The initial state of the K model contains only a single frame in the frame stack. This corresponds to the main() function of your program. We don't otherwise specify the content of the initial global state and frame stack.

We assume there is a function eval which takes a global state s and a frame f as input. This is a mathematical function, not a function in the C/C++ sense. It always returns some value, and has no side-effects. We assume eval may return one of three kinds of values:

Push(f'), which represents a request to push a new frame f' onto the frame stack;
Pop, which represents a request to pop the topmost frame from the frame stack;
Step(s', f'), which represents a request to replace the current global state with s', and replace the current topmost frame with f'.

The operational semantics of the K model can thus be represented as follows:

s <- initial global state
stk <- [f] where f is the initial frame

Loop until stk == []:
    f <- first element of stk
    if eval(s, f) == Push(f'):
        stk <- f' :: stk // x :: y means adding x to the beginning of the list y
    else if eval(s, f) == Pop:
        stk <- stk with its first element removed
    else if eval(s, f) == Step(s', f'):
        s <- s'
        stk <- stk with its first element replaced with f'

Thus the execution of each statement that is not a a function call or a return corresponds to eval(s, f) returning the updated global and local states. A function call corresponds to eval(s, f) returning a new frame f' to be pushed onto the stack. A return state corresponds to popping one frame off the stack. We did not explicitly say how arguments are passed to the callee and how the return value is passed to the caller. We shall assume there is some state variable in s that stores these values.

The execution terminates when the frame stack becomes empty, which corresponds to the main() function returning. The exit() syscall can be emulated as a special kind of exception after we introduce a model of exceptions.

We now describe the C++ model of exception handling as a modified version of the K model described above. The modified model can be called the XK model (K with eXceptions).

The basic idea of exception handling is to look for a frame in the frame stack that is capable of handling the exception, and transfer the control flow to that frame. However, since C++ objects have destructors, even if a frame is not capable of handling the exception, we still need to give it a chance to clean up its resources. This process is known as stack unwinding. Moreover, since destructors are also C++ programs, they might also throw exceptions. A formal model of exception handling needs to accommodate all corner cases that may arise during this process.

To make the basic idea precise we need to modify the notion of a "frame" a bit. In the context of compilers and debuggers each "frame" usually corresponds to a function invocation. However, for the purpose of C++ exception handling, we shall define a "frame" as a layer of execution that either (1) may handle an exception; or (2) needs to clean up some resource during stack unwinding. In particular, since every variable with automatic storage needs to have its destructor executed when we leave its scope, the creation of every variable with automatic storage corresponds to pushing a new frame onto the frame stack.

Suppose that T is a type with non-trivial constructor and destructor. Consider the following code snippet:

T t1, t2, t3;
/* At this point, at least three "frames" exist. */

If the constructor of t1 throws, then we do not need to execute the destructor of t2 and t3 since they do not exist yet. However, if the constructor of t3 throws, then we need to execute the destructor of t1 and t2. As such, we consider the above code snippet to introduce at least three frames, one after each instance of T is constructed. We say "at least three," because the members of T might have their own constructors and destructors which might also throw. The general rule of C++ is that destructors are always called in the reverse order of constructors. Therefore, in general the list of variables with automatic storage can be considered a stack, and resource clean up proceeds by popping objects off this stack.

To summarize, for C++ we consider a "frame" is pushed onto the frame stack, whenever one of the following two events occur:

A variable with automatic storage is constructed;
Execution enters a try/catch construct.

The state of the XK model is modified from the K model as follows:

We modify the frame stack stk so that each element is now a pair of values (f, status). The status field may be either Normal or ExceptionCleanup.
We allow the eval function to return RaiseException(e), in addition to the three values introduced above, where e represents an exception object. Like the "global state" and the "frame", we do not define the content of e.
We assume a new (mathematical) function handle_exception(s, f, e) which returns a tuple (b, f') where b is a boolean value and f' is a frame. If b == true this means the frame f is capable of handling the exception e, and the local state of frame f should transition to f'. If b == false this means the frame f is not capable of handling the exception, and it should transition to f' to perform resource cleanup.
We introduce a new component to the state of the K model, called the "exception stack" and denoted by ex_stk. Its purpose will be made clear as we introduce the operational semantics of the XK model. In the initial state, ex_stk is the empty list.

The operational semantics of the XK model can be represented as follows:

s <- initial global state
stk <- [(f, Normal)] where f is the initial frame
ex_stk <- []

Loop until stk == []:
    (f, status) <- first element of stk

    if eval(s, f) == Push(f'):
        stk <- (f', Normal) :: stk

    else if eval(s, f) == Pop:
        stk <- stk with its first element removed
        if status == ExceptionCleanup:
            // Resume stack unwinding
            if stk == []:
                terminate execution
            (f', status') <- first element of stk
            if status' == ExceptionCleanup:
                terminate execution
            e <- first element of ex_stk // There must be at least one element in ex_stk
            (b, f'') <- handle_exception (s, f', e)
            if b == true:
                stk <- stk with its first element replaced with (f'', Normal)
                ex_stk <- ex_stk with its first element removed.
            else:
                stk <- stk with its first element replaced with (f'', ExceptionCleanup)

    else if eval(s, f) == Step(s', f'):
        s <- s'
        stk <- stk with its first element replaced with (f', status)

    else if eval(s, f) == RaiseException(e):
        if status == ExceptionCleanup:
            terminate execution
        // Initiate stack unwinding
        (b, f') <- handle_exception (s, f, e)
        if b == true:
            stk <- stk with its first element replaced with (f', Normal)
        else:
            ex_stk <- e :: ex_stk
            stk <- stk with its first element replaced with (f', ExceptionCleanup)

// Note: It can be seen that the above loop preserves the invariant that the number of frames with status == ExceptionCleanup is equal to the number of elements in ex_stk.

When an exception is raised during execution, we first check whether we are in normal execution or performing resource cleanup due to an exception. In the latter case, the execution terminates because the exception handling mechanism cannot handle two simultaneous exceptions. Since the only functions called during resource cleanup are destructors, the general wisdom is that destructors in C++ should not throw. See https://blog.knatten.org/2010/06/18/why-you-shouldnt-throw-in-destructors/.

We then check whether the current topmost frame is capable of handling the exception. If so, the control flow is immediately transferred to the exception handler code. Otherwise, the topmost frame is put into resource cleanup mode. When the topmost frame finishes calling destructors, we check whether the second topmost frame is capable of handling the exception. The loop continues until we encounter a frame capable of handling the exception, in which case we say the exception is caught, or the entire frame stack is emptied, in which case we say the exception is uncaught, and the execution terminates.

Consider the following code snippet, which comes from https://stackoverflow.com/questions/75876597/can-be-the-number-of-uncaught-exceptions-be-more-than-one:

#include <iostream>
#include <exception>

struct tracker {
    ~tracker() {
        std::cout << std::uncaught_exceptions() << "\n";
    }
};

struct foo {
    ~foo() {
        try {
            tracker t;
            throw 123;
        } catch(...) {
            std::cout << std::uncaught_exceptions() << "\n";
        }
    }
};


int main() {
    try {
        foo f;
        throw 42;
    } catch(...) {}
    return 0;
}

When we execute throw 42, the object f will be destructed, which causes throw 123 to be executed. While in general, if a destructor throws during stack unwinding, the execution will terminate, in this case it is perfectly fine, because throw 123 will be caught by the catch(...) clause in ~foo(). What happens here can be explained in our formal model as follows. Recall that every try/catch construct and every automatic variable introduces a frame in our abstract model. Thus when throw 42 executes there are two frames. The topmost frame corresponds to the object f, while the second frame corresponds to the try construct in main():

-----------------------
| foo f               |
-----------------------
| try/catch in main() |
-----------------------

Since only try constructs may possibly handle an exception, the topmost frame is put into resource cleanup mode, which causes ~foo() to be called. This introduces two new frames. The topmost frame is now the object t, and the second topmost frame is the try construct in ~foo():

-----------------------
| tracker t           |
-----------------------
| try/catch in ~foo() |
-----------------------
| foo f (cleaning up) |
-----------------------
| try/catch in main() |
-----------------------

When throw 123 is executed, the topmost frame is put into cleanup mode. At this point ex_stk contains two exceptions (42 and 123). Hence ~tracker() prints 2. After ~tracker() returns, the second topmost frame handles the exception 123. As soon as we enter the catch clause in ~foo(), the exception 123 is considered caught and removed from the ex_stk. Hence ~foo() prints 1. At this point, the frame corresponding to the try construct in ~foo() is still in normal mode, not cleanup mode. Therefore, when ~foo() returns the program continues execution normally. The catch clause in main() handles the exception 42.

Overview of the Itanium Base ABI

The goal of the "Base ABI" is to provide a generic infrastructure for implementing the exception handling model described above.

In C++, whether an exception can be handled by a catch clause depends strictly on the type of the exception and the type specified by the catch handler. There is also a catch-all clause catch(...) which can catch all exceptions. As such, it is possible to first search through the stack to find the frame capable of handling the exception, and unwind the stack right to that point. For this reason, the Itanium ABI specifies a two-phase process for stack unwinding:

In the search phase, the framework repeatedly calls the
personality routine, with the _UA_SEARCH_PHASE flag as
described below, first for the current PC and register state,
and then unwinding a frame to a new PC at each step, until the
personality routine reports either success (a handler found in
the queried frame) or failure (no handler) in all frames. It
does not actually restore the unwound state, and the
personality routine must access the state through the API.

If the search phase reports failure, e.g. because no handler
was found, it will call terminate() rather than commence phase 2.

If the search phase reports success, the framework restarts in
the cleanup phase. Again, it repeatedly calls the personality
routine, with the _UA_CLEANUP_PHASE flag as described below,
first for the current PC and register state, and then unwinding
a frame to a new PC at each step, until it gets to the frame
with an identified handler. At that point, it restores the
register state, and control is transferred to the user landing
pad code.

The specification goes on to explain that

A two-phase exception-handling model is not strictly necessary
to implement C++ language semantics, but it does provide some
benefits. For example, the first phase allows an
exception-handling mechanism to dismiss an exception before
stack unwinding begins, which allows resumptive exception
handling (correcting the exceptional condition and resuming
execution at the point where it was raised). While C++ does not
support resumptive exception handling, other languages do, and
the two-phase model allows C++ to coexist with those languages
on the stack.

However, this possibility seems to be little explored in today's programming languages. Today, features like resumptive exception handling would probably be implemented via coroutines or algebraic effect handlers.

The entry point for raising an exception is _Unwind_RaiseException(), which takes an argument struct _Unwind_Exception *exception_object corresponding to the exception object e in our abstract model. The _Unwind_RaiseException() function will execute the two phases explained above to handle the exception. However, in some cases one already knows the exact point the stack needs to be unwound to. For example, if one is implementing destructor-aware setjmp()/longjmp() for C++ then the exception handler frame is simply the frame that called setjmp(). In this case the Itanium ABI provides an alternative path that skips the searching phase. The entry point is _Unwind_ForcedUnwind() which unwinds the frames one-by-one and allows the caller to determine when exactly stack unwinding should stop.

The exception object needs to be allocated in the heap, since the stack is being unwound and may get overwritten by the destructor functions. However, the memory pool used to allocate these objects should be relatively isolated from the usual memory pool used by malloc/free and new/delete. This is because new/delete may throw out-of-memory exceptions. In such scenarios we still want to propagate the error information to the user. The C++ specification says: (https://eel.is/c++draft/basic.stc.dynamic.allocation)

In particular, a global allocation function is not called to allocate storage for ... an exception object.

The ABI provides a helper function called _Unwind_DeleteException() to delete an exception object. The procedure that initially allocated the exception object should specify how to deallocate it.

How the stack unwinding library finds the frames in the stack is implementation-defined. Information about a frame is contained in an opaque structure struct _Unwind_Context that is variously called the "unwind info block", "unwind descriptor block", etc. The structure contains in particular a pointer to a function called the "personality routine". The personality routine inspects the frame and the exception object, and determines whether the frame can handle the exception or not. If the frame cannot handle the exception, it is given a chance to cleanup its resources. After cleaning up the resources, the frame should call _Unwind_Resume().

The intention of this design is that different programming language runtimes can install different personality routines for each frame, so that exception handling for different languages can interoperate. However, exception handling interoperation has rarely worked well in practice, even between different compilers for the same target.

There are a few more functions intended to support the personality routine. These are _Unwind_GetGR, _Unwind_SetGR, _Unwind_GetIP, _Unwind_SetIP, _Unwind_GetLanguageSpecificData, and _Unwind_GetRegionStart.

The `libgcc` Implementation

The libgcc implementation of the base ABI (specifically, the dw2 implementation) is based on data structures that are similar to, but not exactly the same as, DWARF debug information. See https://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/ehframechpt.html. The AArch64-specific portions of DWARF (in particular, the mapping from register numbers to actual registers) are specified in https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst. In the remaining part of this section, a "frame" corresponds to an invocation of a C/C++ function, not the abstract notion of "frame" we introduced previously.

When GCC builds an object file with exceptions enabled, it emits a special section called .eh_frame. For each function in the object file, there is a corresponding entry in .eh_frame called a "Frame Description Entry" (FDE). FDE records the following information:

The memory range of instructions constituting this function (functions are assumed to be non-overlapping);
How to retrieve the return address of this function (for AArch64, usually by reading the x30 register, if it is unclobbered);
How to restore the value of registers that are clobbered by this function.

The third point is the most important part. If a function clobbers a callee-saved register, then the unwinding library must restore its value before returning the control flow to its caller. It must interpret the FDE to learn how to restore these values.

For each callee-saved register, there are three possible locations for its value:

The register is not clobbered: the value is still in the same register;
The value is stored in another register;
The value is stored on the stack. In this case the location is expressed as the Canonical Frame Address (CFA) plus an offset. The CFA is what we usually call the "frame pointer". However, even if the program is compiled with -fomit-frame-pointer, the CFA can still be recovered via the FDE.

Moreover, the location of the value may change during the execution of the function. Therefore, these locations must be defined at each instruction of the function. This is achieved by interspersing the assembly code of the function with Call Frame Information (CFI) directives. See https://sourceware.org/binutils/docs/as/CFI-directives.html and https://refspecs.linuxbase.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/dwarfext.html. Will Cohen has written an excellent blog post on how CFI works: https://opensource.com/article/23/3/gdb-debugger-call-frame-active-function-calls. You can compile a C++ program with g++ -S to see how CFI directives are inserted. The CFI directives of an object file can be inspected with readelf -debug-dump=frames.

FDE additionally contains a pointer to a structure called the Language Specific Data Area (LSDA). The LSDA structure is stored in a section called .gcc_except_tables. This is where GCC stores type information about the exception handlers. The personality routine (__gxx_personality_v0 defined in libsupc++) reads this information to determine whether a catch clause is capable of handling an exception object.

The .eh_frame section of each object file begins with a Common Information Entry (CIE) record. The CIE record contains information common to all subsequent FDE records. When the object files are linked together, the linker automatically merges .eh_frame sections with identical CIE. Normally, all object files should have the same CIE. The linker also sorts the FDE entries by the memory location of each function. See the bfd/elf-eh-frame.c file of binutils. The file crtbegin.o supplies a symbol called __EH_FRAME_BEGIN__ that is placed at the beginning of the concatenated .eh_frame. The file crtend.o supplies a symbol __FRAME_END__ that marks the end of .eh_frame.

The file crtbegin.o supplies a global constructor function called frame_dummy(). This function calls __register_frame_info() with &__EH_FRAME_BEGIN__ as a parameter. If dynamic libraries are loaded, each library should also call __register_frame_info() to register their own .eh_frame sections. When a dynamic library is unloaded, it should call __deregister_frame_info() to deregister the .eh_frame section. The __do_global_dtor_aux() global destructor function (supplied by crtbegin.o) calls __deregister_frame_info() for the executable.

The function _Unwind_Find_FDE() defined in unwind-dw2-fde.c looks up the list of registered .eh_frame sections to find the FDE entry for a given function. It also attempts to build a binary search tree for the FDE entries. This requires dynamic memory allocation, which is why libgcc depends on an implementation of malloc and free.

The above description applies mostly to bare-metal targets without a standard C library. If a POSIX-compliant standard C library is available, there is an additional way for libgcc to traverse the .eh_frame sections. With a command-line flag called --eh-frame-hdr, the linker builds an additional section called .eh_frame_hdr and writes its address into a dynamic section entry with type DT_GNU_EH_FRAME. Then libgcc calls dl_iterate_phdr to find this section. The section contains a binary search tree for the FDE record of each function, so there is no need for libgcc to build it on its own.

The GCC compiler provides several built-in functions to support stack unwinding.

__builtin_unwind_init(): If this function appears in the source code of a function, then GCC will store the value of all callee-saved registers on the stack in the function prologue. Thus in the FDE record, the location of every callee-saved register will be the CFA plus some offset. See the expand_builtin_unwind_init() function defined in except.cc in the GCC source code.
__builtin_return_address(): This function returns the return address of the current frame. See https://gcc.gnu.org/onlinedocs/gcc/Return-Address.html.
__builtin_dwarf_cfa(): This function returns the CFA of the current frame. See the builtins.cc file in the GCC source code (grep for BUILT_IN_DWARF_CFA).

The stack unwinding entry points _Unwind_RaiseException() and _Unwind_ForcedUnwind() call the built-in function __builtin_unwind_init(). Then they call a function uw_init_context_1() which is marked as noinline. Inside the uw_init_context_1() function, we call __builtin_return_address() to get the address of the instruction that called uw_init_context_1(). We then decode the FDE record of _Unwind_RaiseException() or _Unwind_ForcedUnwind() up to that instruction. The uw_init_context_1() function also relies on the caller to provide its CFA. We thus have everything ready to crawl through the stack and restore the registers of each frame. The rest is mostly straightforward, modulo technical details.

`libgcc` Part V: Architecture-specific Functionalities

The files under this category are __aarch64_have_sme.o, __arm_sme_state.o, __arm_sme_state_s.o, __arm_tpidr2_restore.o, __arm_tpidr2_restore_s.o, __arm_tpidr2_save.o, __arm_tpidr2_save_s.o, __arm_za_disable.o, __arm_za_disable_s.o. They are all related to the Scalable Matrix Extension (SME) functionality of AArch64, which is similar to the Advanced Matrix Extension (AMX) of x86-64.

Aside from the usual vector registries introduced by SIMD instruction sets, the SME functionality introduces a new register called ZA. This is a matrix register with a size of SVL_B x SVL_B bytes, where SVL_B depends on the implementation. See https://developer.arm.com/documentation/109246/0100/Introduction/The-Scalable-Matrix-Extensions/Streaming-SVE-mode-and-ZA-storage.

The ZA register needs to be explicitly enabled. Clang has introduced function attributes including __arm_in, __arm_out, __arm_inout, __arm_new, __arm_preserves, __arm_agnostic to support this feature. See https://clang.llvm.org/docs/AttributeReference.html. In GCC, the corresponding attributes are called arm::in, arm::out, arm::inout, arm::preserves, but they seem to be little documented. See https://patchwork.ozlabs.org/project/gcc/patch/[email protected]/.

The ZA register introduces special problems for context switching, both for OS developers and userspace developers (due to exception handling). The x86 developers already saw this kind of mess when integrating support for AMX into the Linux kernel. See https://www.phoronix.com/news/Linux-5.13-More-XSTATE-Mess. The issue is that the ZA register could be very large. By the usual register-clobbering rules, every function that uses ZA is required to save the old content of ZA. However, since nested uses of ZA are expected to be rare, most of the stores and loads are a waste of time and stack space. See https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst (Section 6.6).

Therefore, ARM introduced a scheme known as lazy saving. It involves a register called TPIDR2_EL0. When a function makes a function call, the caller can setup a buffer for the callee to use if it needs to clobber ZA. When the callee returns, the caller can check the content of TPIDR2_EL0 to learn whether ZA is clobbered or not. If it is clobbered, the old content is written into the buffer the caller has set up for the callee.

The difference between the object files with and without the _s suffix is that, in the files with _s suffix the functions have default visibility, whereas in the files without the suffix the functions have hidden visibility. See https://glandium.org/blog/?p=2510 for why this matters.

`libgcc` Part VI: Emulated Thread-local Storage (TLS)

Starting from C11 and C++11 global objects can be marked as thread-local using _Thread_local (C11) and thread_local (C++11). A new instance of the object is created upon thread creation, and destroyed when the thread exits. How TLS is implemented is described in https://www.akkadia.org/drepper/tls.pdf. In short, the ELF file provides template .data and .bss sections, called .tdata and .tbss. Each time a thread is created, we allocate some memory for these sections, copy over the tmplate, apply some relocation fixups, and run the constructors.

For targets without a standard C library that supports TLS, libgcc provides an "emulated" implementation of TLS in emutls.o. See https://gcc.gnu.org/onlinedocs/gccint/Emulated-TLS.html. Each TLS object is associated with a "control object". On platforms with a POSIX-compliant standard C library, emutls.o uses pthread_getspecific and pthread_setspecific to manage a per-thread memory area of type struct __emutls_array. The first time any thread accesses a TLS object, a offset into the memory area is assigned to that TLS object. The offset is written into the control object. From that point on every thread uses the same offset into its own struct __emutls_array as the storage area for the object.

On aarch64-none-elf, emutls.o is almost a dummy. It assumes there is only one thread. The first time a TLS object is accesses, it allocates spaces for the object by calling malloc. After that, it always returns the same dynamically allocated instance.

`libgcc` Part VII: Trampolines and Executable Stacks

The C compiler (but not the C++ compiler) of GCC has traditionally supported a feature called nested functions (https://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html). Nested functions can access variable declared in the outer function, a feature known as lexical scoping. This is achieved by passing an additional argument to the nested function which is the CFA of the outer function. The argument is passed via a dedicated register called the static chain register.

A tricky issue arises when we want to take the pointer to a nested function and pass it around like normal function pointers. The non-local caller of the function would not be aware that the function is nested, and would not be able to setup the static chain register.

The GCC solution to this problem is to generate a trampoline on the stack (https://gcc.gnu.org/onlinedocs/gccint/Trampolines.html) whenever the address of a nested function is taken. The trampoline is a small piece of executable code that sets up the static chain register and jumps to the function. Since the CFA of the outer function may be different at each invocation, a new trampoline must be generated at each outer function invocation.

On some targets, GCC needs a helper function for executing trampolines. This function is called __transfer_from_trampoline(). A grep through the source code of GCC shows that this functionality is now very little used. If it is present, it should be compiled into _trampoline.o, but this file is now empty for most platforms.

According to a changelog entry in 2004, the functionality that __transfer_from_trampoline() used to provide is now moved to __enable_execute_stack() in enable-execute-stack.o. For security reasons, most operating systems forbid executing code on the stack by default. The program needs to make a special syscall to allow this behavior. The __enable_execute_stack() function is a helper function to make a given page of stack executable. On UNIX platforms it is implemented via a call to mprotect(). See enable-execute-stack-mprotect.c.

`libgcc` Part VIII: Stack Scrubbing

Stack scrubbing ("strub") is a relatively new feature of GCC that allows automatically zeroing the stack used by a function after leaving. See https://blog.adacore.com/adacore-enhances-gcc-security-with-innovative-features. This is very useful for cryptographic applications manipulating sensitive data. In particular, it has been reported that implementing the Federal Information Processing Standard (FIPS) 140-3 requires cleaning up all sensitive keys in memory, including public keys (see https://www.redhat.com/en/blog/openssl-fips-140-2-upstream-140-3-downstream).

The stack scrubbing feature does not seem to be fully documented. There is some explanation of the the internal functions it uses (https://gcc.gnu.org/onlinedocs/gcc/Stack-Scrubbing.html). There is also some documentation on enabling it for the Ada language (https://gcc.gnu.org/onlinedocs/gnat_rm/Stack-Scrubbing.html). Finally, the compiler switches related to this feature are explained in https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html (see -fstrub). However, there are no examples of how to use this feature in C/C++ programs. The best source of information for this feature is probably the patchset that introduced it (https://patchwork.sourceware.org/project/gcc/cover/[email protected]/).

The stack scrubbing feature is implemented in three parts. First, a new attribute called __strub__ is introduced to C and C++ that marks functions for stack scrubbing. The attribute can be applied to both functions and global variables. If it is applied to a global variable, then all functions that access the variable will be marked for stack scrubbing. If it is applied to a function, then the attribute may optionally take a argument specifying the scrubbing mode. Possible choices are __strub__("at-calls") and __strub__("internal"). The default is __strub__("at-calls"). There is also __strub__("disabled") to explicitly disable stack scrubbing for a function, and __strub__("callable") whose purpose will be explained later.

Second, depending on the scrubbing mode, the compiler inserts calls to some builtin functions that mark stack regions for scrubbing. These builtin functions should not be explicitly called in the source code.

Third, the actual scrubbing procedures are implemented in libgcc. They are contained in the file strub.o.

The basic operation of stack scrubbing is as follows. When a normal function makes a call to a function that demands stack scrubbing, before making the call the caller initializes a data-structure called a "watermark". The watermark structure records the stack top address just before the call, and also the amount of stack used by the callee. Since the caller does not know how much stack the callee will use, this information must be filled in by the callee itself. Thus, the caller must pass a pointer to the watermark structure to the callee. This is a change to the ABI of the callee.

There are two ways to implement the change. The first way is called "at-calls" where the function signature is changed to take the pointer argument. However, this approach is not viable if backward-compatibility of the callee ABI is required. The second way is called "internal" where the function signature is not changed. However, the function body is moved into a clone. The function itself becomes a wrapper over the clone that initializes a watermark structure.

We say a function is a strub context if its signature includes a pointer to the watermark structure. For a function with internal scrubbing mode, the wrapper (the function itself) is not a strub context, but the clone (the actual function body) is a strub context. However, there is no way to call the clone directly. You must call it through the wrapper.

If a function gets marked with __strub__ because it accesses a variable marked with __strub__, and is not otherwise marked with __strub__, then GCC gets to pick a scrubbing mode for the function. Without optimizations enabled, the scrubbing mode would normally be "internal". However, if optimizations are enabled and the function is static (accessible only within the current translation unit), then the scrubbing mode may become "at-calls". This behavior should not be relied upon.

There are restrictions on the control-flow of strub contexts. Enforcement of the restrictions can be controlled by the switches -fstrub=strict and -fstrub=relaxed. If -fstrub=strict is applied, then strub contexts may only call other strub contexts. In particular, they may not call functions marked with __strub__("internal") since the wrapper is not a strub context. There is a special exception: if a function is marked with both __strub__("internal") and always_inline, then it can be called from a strub context, since the wrapper is skipped and the clone (the function body) is directly inlined. However, such functions cannot be called from non-strub contexts.

The -fstrub=relaxed switch relaxes this restriction so that strub contexts are only forbidden from calling functions that are explicitly marked with __strub__("disabled"). Still, if a strub context calls a non-strub context, then the stack of the non-strub context is not scrubbed, which might be a security vulnerability.

In some cases one may want to mark a function as not needing stack scrubbing. This means the function is not a strub context, but is nevertheless safe to call from strub contexts. GCC provides an attribute __strub__("callable") for such functions. A strub context may always call a __strub__("callable") function, but the stack of the callee is not scrubbed.

A summary of the control-flow rules of strub contexts is as follows. If -fstrub=strict is in effect, then:

	Callee unmarked	Callee is `__strub__("at-calls")`	Callee is `__strub__("internal")`	Callee is `__strub__("internal")` and `always_inline`	Callee is `__strub__("callable")`	Callee is `__strub__("disabled")`
Caller is unmarked	✅	✅	✅	❌	✅	✅
Caller is `__strub__("at-calls")`	❌	✅	❌	✅	✅	❌
Caller is `__strub__("internal")`	❌	✅	❌	✅	✅	❌
Caller is `__strub__("internal")` and `always_inline`	❌	✅	❌	✅	✅	❌
Caller is `__strub__("callable")`	✅	✅	✅	❌	✅	✅
Caller is `__strub__("disabled")`	✅	✅	✅	❌	✅	✅

If -fstrub=relaxed is in effect, then:

	Callee unmarked	Callee is `__strub__("at-calls")`	Callee is `__strub__("internal")`	Callee is `__strub__("internal")` and `always_inline`	Callee is `__strub__("callable")`	Callee is `__strub__("disabled")`
Caller is unmarked	✅	✅	✅	❌	✅	✅
Caller is `__strub__("at-calls")`	✅	✅	✅	✅	✅	❌
Caller is `__strub__("internal")`	✅	✅	✅	✅	✅	❌
Caller is `__strub__("internal")` and `always_inline`	✅	✅	✅	✅	✅	❌
Caller is `__strub__("callable")`	✅	✅	✅	❌	✅	✅
Caller is `__strub__("disabled")`	✅	✅	✅	❌	✅	✅

There is also a flag -fzero-call-used-regs for GCC to zero the caller-saved registers. See https://lwn.net/Articles/870045/, https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, and https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html.

`libgcc` Part IX: Control-flow Redundancy

Control-flow Redundancy (CFR) is another security hardening feature added to GCC by AdaCore. It can be explicitly enabled with -fharden-control-flow-redundancy. The idea is to use a bit-array to record the basic blocks of a function that has been executed during an invocation. Upon return, the function checks that the bit-array is consistent. Specifically, it checks that if a block is executed, then at least one of its predecessors is executed, and at least one of its successors is executed. See https://gcc.gnu.org/onlinedocs/gnat_rm/Control-Flow-Redundancy.html.

CFR requires some helper functions that are implemented in hardcfr.c. The source code is straightforward to read.

`libgcc` Part X: Miscellaneous Files

_eprintf.o: When GCC is compiled with standard C library headers provided, this file provides a function called __eprintf(). It is similar to printf(), but prints to stderr instead of stdout. The comments for this function in libgcc/libgcc2.c read:
```
/* __eprintf used to be used by GCC's private version of <assert.h>.
 We no longer provide that header, but this routine remains in libgcc.a
 for binary backward compatibility.  Note that it is not included in
 the shared version of libgcc.  */
```
On bare-metal targets, _eprintf.o is empty.
__gcc_bcmp.o: bcmp() is a POSIX function that compares two memory regions. It returns 0 if the two regions are identical, and a non-zero value if the two regions are not identical. For all purposes, it has been superseded by memcmp(). For some reason, libgcc provides a function called __gcc_bcmp() that is functionally identical to memcmp():
```
/* Like bcmp except the sign is meaningful.
 Result is negative if S1 is less than S2,
 positive if S1 is greater, 0 if S1 and S2 are equal.  */
 int __gcc_bcmp (const unsigned char *s1, const unsigned char *s2, size_t size);
```
A grep through the source code of GCC shows that GCC does not emit calls to this function. Like __eprintf(), it probably exists only for backward compatibility.
__main.o: On platforms that do not have a standard way to run constructor and destructor functions, GCC modifies the main() function of the executable to call a special function called __main() at the beginning. The __main() function walks the constructor array and calls those functions. The __main.o file is empty if the target supports either .init and .fini, or .init_array and .fini_array:
```
/* Subroutine called automatically by `main'.
 Compiling a global function named `main'
 produces an automatic call to this function at the beginning.

 For many systems, this routine calls __do_global_ctors.
 For systems which support a .init section we use the .init section
 to run __do_global_ctors, so we need not do anything here.  */
```
_clear_cache.o and sync-cache.o: GCC provides a builtin function called __builtin___clear_cache() to invalidate the CPU instruction cache of a given memory region. The file _clear_cache.o contains only a single jump to sync-cache.o which is the AArch64-specific implementation of _clear_cache(). This function is necessary for self-modifying code, or programs that use Just-In-Time (JIT) compilation.

`libsupc++` Part I: Dynamic Memory Allocation

The files under this category are:

new_handler.o
new_op.o
new_opnt.o
new_opv.o
new_opvnt.o
new_opa.o
new_opant.o
new_opva.o
new_opvant.o
del_op.o
del_ops.o
del_opnt.o
del_opv.o
del_opvs.o
del_opvnt.o
del_opa.o
del_opant.o
del_opsa.o
del_opva.o
del_opvant.o
del_opvsa.o

The file new_handler.o is related to the "new-handler", a function that is called by the default implementation of the new operator when memory allocation fails. See https://en.cppreference.com/w/cpp/memory/new/set_new_handler.html.

The other files are various versions of the new/delete operator. They are mostly wrappers over malloc, free, and aligned_alloc. See https://en.cppreference.com/w/cpp/memory/new/operator_new.html and https://en.cppreference.com/w/cpp/memory/new/operator_delete.html.

The versions with suffix _op only are the basic versions. A suffix of nt means the operator should not throw. However, the delete operator never throws. A delete operator with the nt suffix merely means it is called by the corresponding noexcept version of the new operator upon initialization failure. (In https://en.cppreference.com/w/cpp/memory/new/operator_delete.html, these versions are called "placement deallocation functions" which is confusing and incorrect.)

A suffix of v means an array is allocated or deallocated. A suffix of a means the allocation has custom alignment requirements. A suffix of s means the delete operator has a size argument. See https://isocpp.org/files/papers/n3778.html.

`libsupc++` Part II: Runtime Type Information (RTTI)

The files under this category are:

tinfo.o
tinfo2.o (empty)
array_type_info.o
class_type_info.o
enum_type_info.o
function_type_info.o
fundamental_type_info.o
pbase_type_info.o
pmem_type_info.o
pointer_type_info.o
si_class_type_info.o
vmi_class_type_info.o
dyncast.o

In general, GCC follows the Itanium ABI (https://itanium-cxx-abi.github.io/cxx-abi/abi.html#rtti) to implement RTTI. The file tinfo.o implements the base class std::type_info. The file dyncast.o implements the dynamic cast algorithm. The other files implement the subclasses of std::type_info for different kinds of types.

Because std::type_info has a virtual destructor, the destructors for all subclasses of it are compiled twice. See https://stackoverflow.com/questions/44558119/why-do-i-have-two-destructor-implementations-in-my-assembly-output.

`libsupc++` Part III: The Exception Classes

The files under this category are:

eh_exception.o
bad_alloc.o
bad_array_length.o (only for backward compatibility)
bad_array_new.o
bad_cast.o
bad_typeid.o
guard_error.o
nested_exception.o

The file eh_exception.o implements the types std::exception and std::bad_exception. The type std::exception is the base of all other exceptions in the standard library.

The next five files are self-explanatory. They are the only exceptions that might be thrown by libsupc++.

The exception guard_error requires some explanation. If you declare a static variable inside a function, then the variable is constructed upon the first time a thread enters the scope of the variable. If there are multiple threads, the static variable must still only be constructed once. Therefore, the GCC compiler uses a guard object to ensure only one thread will call the constructor of the object. The exception guard_error is thrown when this object detects some internal error.

The use of guard objects for local statics can be omitted with the compiler flag -fno-threadsafe-statics, but any race condition will result in undefined behavior.

The libstdc++ library (in freestanding mode) can throw a few more kinds of exceptions. These are invalid_argument, out_of_range, runtime_error, overflow_error, regex_error. See bits/functexcept.h in the installed headers. The libstdc++ library does not throw them directly, but indirectly through functions called __throw_X() where X is the exception name.

The nested_exception.o file implements std::nested_exception. It allows an exception handler to rethrow a caught exception with some added information. Its implementation is based on std::exception_ptr, which is a reference-counted smart pointer to the exception object (see below).

`libsupc++` Part IV: Exception Handling

The files considered here are:

eh_alloc.o
eh_arm.o (empty)
eh_aux_runtime.o
eh_call.o
eh_catch.o
eh_exception.o
eh_globals.o
eh_personality.o
eh_ptr.o
eh_term_handler.o
eh_terminate.o
eh_tm.o
eh_throw.o
eh_type.o
eh_unex_handler.o

These files implement the second layer of exception handling in the Itanium ABI (see https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html#cxx-abi), together with the exception functionality of the C++ standard library.

The second layer of the Itanium ABI consists of the following functions:

__cxa_get_globals()
__cxa_allocate_exception()
__cxa_free_exception()
__cxa_get_exception_ptr()
__cxa_current_exception_type()
__cxa_throw()
__cxa_begin_catch()
__cxa_end_catch()
__cxa_rethrow()

The GCC compiler emits __cxa_allocate_exception() and __cxa_throw() at each throw statement. Depending on the situation it might emit __cxa_free_exception() if construction of the exception object may fail (see cp/except.cc in the GCC source code). It emits __cxa_get_exception_ptr(), __cxa_begin_catch() and __cxa_end_catch() at each catch clause (the first one is optional). It emits __cxa_rethrow() at each rethrow statement (throw without argument).

A grep through the source code of GCC did not reveal places where it would emit __cxa_get_globals() or __cxa_current_exception_type().

When a throw statement is executed, the first action is to evaluate the throw expression. At this point we are not "officially" performing the throw yet. If evaluating the expression throws another exception, that exception preempts the exception to be thrown.

Next a buffer for the exception object is allocated via __cxa_allocate_exception(). The object is prepended with the following header:

struct __cxa_exception { 
    std::type_info * exceptionType;
    void (*exceptionDestructor) (void *); 
    unexpected_handler unexpectedHandler;
    terminate_handler terminateHandler;
    __cxa_exception * nextException;

    int handlerCount;
    int handlerSwitchValue;
    const char * actionRecord;
    const char * languageSpecificData;
    void * catchTemp;
    void * adjustedPtr;

    _Unwind_Exception unwindHeader;
};

If the throw expression is an lvalue, the exception object is copy-constructed or move-constructed into the buffer. Alternatively, in some cases the copy-constructor can be elided (see https://en.cppreference.com/w/cpp/language/copy_elision.html). If it is an prvalue then it is directly constructed in the buffer. If the constructor throws then the exception is preempted.

When the constructor of the exception object returns, the exception is now officially considered "uncaught". The specification says if the stack unwinding process calls any user function that throws an exception (not internally caught), then the execution terminates.

After filling in information in the header (see the Itanium ABI), the program calls __cxa_throw() to throw the exception, which internally calls _Unwind_RaiseException().

A catch handler first calls __cxa_get_exception_ptr() to get a pointer to the exception object. After copy-initializing the handler argument, it calls __cxa_begin_catch(). The rationale for this two-step process is:

/* If the C++ object needs constructing, we need to do that before
   calling __cxa_begin_catch, so that std::uncaught_exception
   gets the right value during the copy constructor.  */

Accordingly, if the copy-constructor is trivial, the compiler will only emit __cxa_begin_catch().

A catch handler can either exit normally, rethrow the exception, or throw a new exception. In each case the handler must call __cxa_end_catch(). If a new exception is thrown, it is thrown after __cxa_end_catch(). If an exception gets rethrown, the official procedure in the Itanium ABI is as follows:

Call __cxa_rethrow();
Call __cxa_end_catch();
Call _Unwind_Resume(). This is a bit confusing because the specification for _Unwind_Resume() explicitly says:

_Unwind_Resume should not be used to implement rethrowing. To the
unwinding runtime, the catch code that rethrows was a handler, and
the previous unwinding session was terminated before entering it.
Rethrowing is implemented by calling _Unwind_RaiseException again
with the same exception object.

What actually happens seems to be: __cxa_rethrow() calls _Unwind_RaiseException() with the current exception object. The exception handler frame is then put into cleanup mode, which includes calling __cxa_end_catch(). Every non-handler should call _Unwind_Resume() after cleaning up, and this includes the rethrowing handler.

`libsupc++` Part V: Miscellaneous Files

vec.o: This file contains the following functions, defined in cxxabi.h:
```
__cxa_vec_new
__cxa_vec_new2
__cxa_vec_new3
__cxa_vec_ctor
__cxa_vec_cctor
__cxa_vec_dtor
__cxa_vec_cleanup
__cxa_vec_delete
__cxa_vec_delete2
__cxa_vec_delete3
```
A grep through the source code of GCC did not reveal any place where calls to these functions are emitted. They likely exist only for backward compatibility.
vterminate.o: This file is supposed to contain the "verbose terminate handler", a terminate handler that can be optionally installed to provide detailed error information when std::terminate() is called. On bare-metal targets, this file is empty since I/O functionality is not available.
atexit_thread.o: This file contains an implementation of __cxa_atexit_thread(), a function that allows registering functions to be called at thread exit, rather than process exit. Like emutls.o, on bare-metal targets this file is a dummy. It simply assumes there is only one thread, and maintains a linked list of all functions to be called.
atexit_arm.o: On arm (not aarch64) targets, this file implements __aeabi_atexit(). On the aarch64-none-elf target, this file is empty.
pure.o: This file implements __cxxabiv1::__cxa_pure_virtual() and __cxxabiv1::__cxa_deleted_virtual(). They are used to fill in the pure virtual and deleted entries of each virtual table, and they simply call std::terminate().
guard.o: This file implements __cxa_guard_acquire(), __cxa_guard_release(), __cxa_guard_abort(). See Part III above for explanation of their purpose.
hash_bytes.o: This file implements std::_Hash_bytes and std::_Fnv_hash_bytes. They correspond to the Murmur hash and the FNV hash algorithm (see https://en.wikipedia.org/wiki/MurmurHash and https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function). It is used to implement std::hash for std::type_info.

CharlieQiu2017/libgcc.md

A Deep Dive into libgcc And libsupc++

The String Library

The Experiment Setting

libgcc Part I: Arithmetic Functions

libgcc Part II: Atomic Operations

libgcc Part III: C Runtime Initialization and Termination

libgcc Part IV: Stack Unwinding and Exception Handling

An Abstract Model of Exception Handling

Overview of the Itanium Base ABI

The libgcc Implementation

libgcc Part V: Architecture-specific Functionalities

libgcc Part VI: Emulated Thread-local Storage (TLS)

libgcc Part VII: Trampolines and Executable Stacks

libgcc Part VIII: Stack Scrubbing

libgcc Part IX: Control-flow Redundancy

libgcc Part X: Miscellaneous Files

libsupc++ Part I: Dynamic Memory Allocation

libsupc++ Part II: Runtime Type Information (RTTI)

libsupc++ Part III: The Exception Classes

libsupc++ Part IV: Exception Handling

libsupc++ Part V: Miscellaneous Files

A Deep Dive into `libgcc` And `libsupc++`

`libgcc` Part I: Arithmetic Functions

`libgcc` Part II: Atomic Operations

`libgcc` Part III: C Runtime Initialization and Termination

`libgcc` Part IV: Stack Unwinding and Exception Handling

The `libgcc` Implementation

`libgcc` Part V: Architecture-specific Functionalities

`libgcc` Part VI: Emulated Thread-local Storage (TLS)

`libgcc` Part VII: Trampolines and Executable Stacks

`libgcc` Part VIII: Stack Scrubbing

`libgcc` Part IX: Control-flow Redundancy

`libgcc` Part X: Miscellaneous Files

`libsupc++` Part I: Dynamic Memory Allocation

`libsupc++` Part II: Runtime Type Information (RTTI)

`libsupc++` Part III: The Exception Classes

`libsupc++` Part IV: Exception Handling

`libsupc++` Part V: Miscellaneous Files