nwf · September 17, 2024 20:21
diff --git a/Exceptional Sequence Text Pages b/Exceptional Sequence Text Pages
 ###############################
 Exceptional-Sequence Text Pages
 ###############################

 Introduction
 ============

 Occasionally, software wants to impose interesting out-of-band
 constraints on small, "exceptional" sequences of instructions. Examples
 of such constraints include needing to run...

 - to completion without preemption: if thread scheduling takes this
  sequence off-core, the sequence should be restarted, rather than
  continued, upon scheduling resuming this thread.

 - without asynchronous interruption, more generally than preemption, by
  either restarting the sequence (as with preemption) or by *catching
  the interruption* with specialized response.

 - with specialized responses to *synchronous* faults.

 Such sequences are generally short, hand-written in assembler, and have
 few basic blocks. In general, the set of such sequences is static per
 loadable object, but it is not necessarily constant across a *process's*
 lifetime. It seems likely that, if there are more than zero such
 sequences within a library or process, there will also be more than one.

 Desiderata
 ==========

 1. Multiple pieces of software, unknown to each other, within one
   process should be able to avail themselves of such exceptional
   sequences. This extends to dynamically loaded (or perhaps even JIT
   generated) code.

 2. These sequences are generally very short, perhaps as short as a
   single instruction, and so entry and exit from exceptional sequences
   should be as close as possible to 0-instruction overhead and,
   especially, should avoid mutating memory. (The brevity of these
   sequences means that we do not expect them to nest or to include
   calls to other functions.)

 3. Context switch happens very often and is already generally considered
   "too expensive" as it stands, so testing *whether* the currently
   executing code needs special response for preemption should be as
   fast as possible, particularly in the (most common) case that the
   answer is "no."

   This is less of a concern for non-preemption events, such as
   synchronous faults that cannot be transparently handled or
   asynchronous interrupts which redirect control flow, as these events
   are rare and, typically, already have fairly high response costs.

 4. The exceptionalness of these sequences is intrinsic to the sequence
   of instructions; it should not be possible to sometimes run the
   instructions as part of an exceptional sequence and sometimes not.
   This is perhaps merely aesthetic.

 Existing Approaches
 ===================

 Idempotent Sequences, Historically
 ----------------------------------

 Apparently first introduced as "restartable atomic sequences" by
 `Bershad et al. Fast Mutual Exclusion for Uniprocessors (1992)
 <https://dl.acm.org/doi/pdf/10.1145/143371.143523>`__, these are regions
 of code that run to completion entirely on one side or the other of
 preemption. The two implementations discussed in that paper are Mach's,
 which permits an address space to have exactly one restartable atomic
 sequence, and Taos's, which permits multiple sequences but recognizes
 them by parsing the instruction stream.

 On quite old ARM processors, a popular trick was to use idempotent
 sequence machinery to implement atomic operations. See, for example,
 `Linux's documentation of their implementation
 <https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt>`__.

 Existing Special Handling within the FreeBSD Kernel
 ---------------------------------------------------

 Within the FreeBSD kernel, there are two mechanisms used for specialized
 subsets of our aim here.

 "pcb_onfault"
 ~~~~~~~~~~~~~

 Each thread control block contains a "``pcb_onfault``" field. If this is
 not ``NULL``, the processor is running in the kernel, and the system
 encounters a nonresolvable fault, control flow will be redirected to
 this address.  This mechanism is not used to synchronize with preemption
 or for asynchronous events. That aside, considering our desiderata,

 1. This mechanism supports modular software.

 2. This mechanism requires two stores per executed exceptional sequence,
   and in many cases those sequences are shorter than this combined
   entry and exit cost. In other cases, the entry and exit costs are
   amortized and include many instructions that either cannot fault or
   for which a fault should indeed panic the machine.

 3. This mechanism does not interact with context switch. It interacts
   only with the nonresolvable fault path, on which it performs a load,
   a conditional branch, and a possible store, making it quite
   lightweight.

 4. It is, in principle, possible to bypass the entry instructions for an
   exceptional sequence, but it is hard to imagine what utility would be
   gained.

 Hard-coded Program Counter Address Comparisons
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 In the platform support code for arm64, the low-level "external abort"
 handling machinery involves a function named "``test_bs_fault``" that
 explicitly compares the faulting program counter to the address of eight
 symbols from within the kernel.

 1. This is patently non-modular, requiring a priori knowledge of all
   code needing special handling.

 2. Entry and exit are, indeed, free, and these symbols, once added to
   the set considered, can be located anywhere in the kernel.

 3. This mechanism, similarly, does not interact with context switch. It
   sits entirely on a very rare path used generally while probing
   hardware and not at all in the steady state.

 4. The special handling is tied to the instruction's location and so
   cannot be severed.

 Linux's rseq API
 ----------------

 The kernel code is available in

 -  `linux/uapi/rseq.h <https://github.com/torvalds/linux/blob/master/include/uapi/linux/rseq.h>`__
   describes the userspace interface

 -  `rseq.c <https://github.com/torvalds/linux/blob/master/kernel/rseq.c>`__
   contains the bulk of the associated kernel code

 -  `membarrier.c <https://github.com/torvalds/linux/blob/master/kernel/sched/membarrier.c>`__
   also has grown support for restarting rseqs, either generally or on a
   specific CPU.

 See `The 5-year journey to bring restartable sequences to Linux -
 EfficiOS <https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/>`__
 and `Restartable sequences restarted
 [LWN.net] <https://lwn.net/Articles/697979/>`__ for writeups that cover
 some older editions of the API, as well.

 Each thread has at most one "``struct rseq``" associated with it, though
 repeated registrations are *counted*.[#rseq-sign]_ Each "``struct
 rseq``" points to at most one "``struct rseq_cs``" which describes the
 code region for the restartable sequence, giving its start, end, and
 abort locations in the executable. Software must store into the ``struct
 rseq_cs`` pointer within its registered ``struct rseq`` before entering
 each restartable sequence and should perform an additional store to
 clear the pointer after exiting a restartable sequence.

 The kernel consults the registered ``struct rseq`` only on preemption,
 migration, and signal delivery. Flags within the ``struct rseq`` (not ``struct
 rseq_cs``!) can disable the control-flow transfer response of each event
 source; this feature of the API is intended for use by debuggers.
 Speaking of debuggers, ptrace has been extended to support reading the
 current rseq registration.

 .. [#rseq-sign]
   Registrations are also "signed" with a 32-bit value provided to the
   system call that must also be provided at un-registration and must
   equal the 32-bit value prior to the rseq's abort address. Diagnostic
   messages in the Linux kernel suggest that this is an attempt to
   prevent attackers from clobbering an active ``struct req_cs``
   pointer.

 Checking Desiderata
 ~~~~~~~~~~~~~~~~~~~

 1. By slightly extending the *ABI*, so that there is a well-known name
   for *the* "``struct rseq``" managed by TLS machinery, multiple
   disparate pieces of software can use this mechanism. However,
   changing the kernel's notion of registered "``struct rseq``" is an
   ambient authority (a system call) guarded by a 32-bit number that
   must also be shared between the disparate pieces of software, rather
   like the "secret" address space layout randomization (ASLR) slide.

 2. This API requires one store on the entry to a rseq-guarded sequence:
   the pointer to the "``struct rseq_cs``" must be stored into the
   "``struct rseq``". Because the "``struct rseq_cs``" describes not
   just the restart address but also the start and end of the sequence,
   it is not strictly necessary to clear this pointer at the end of the
   sequence, but nevertheless the API documentation suggests doing so.
   Even one store is more than we would like to pay, but it is
   apparently tolerable.

 3. With this API, preemption now must

   a. Test whether the thread has a "``struct rseq``" associated with it
      and, if so,

   b. Load the "``struct rseq_cs``" pointer,

   c. Update the event counters within the "``struct rseq``",

   d. If there is a "``struct rseq_cs``" indicated, load it,

   e. Check whether the preempted address is within the indicated
      bounds, and, if so

   f. Replace the preempted address with the indicated restart address.

   This is fairly lightweight: a few loads, a store, and a few
   conditional branches.

 4. It is, in principle, possible to bypass the store of the "``struct
   rseq_cs``" pointer and still execute the instructions of the
   sequence, disassociating them from their intended rseq use, but,
   again, it is not clear what advantage this would give.

 FreeBSD's rseq proposal
 -----------------------

 See `D32505 Add rseq(2) <https://reviews.freebsd.org/D32505>`__. It is a
 direct port of the Linux API.

 Windows' Structured Exception Handling (SEH)
 --------------------------------------------

 Windows SEH is useful for managing synchronous faults and may be useful
 for asynchronous interrupts in principle, but it is far too heavy weight
 for interacting with preemption a la rseq. In any case, at least as of
 x64, it uses lookaside tables to drive its response and seems to require
 ahead of time registration of loaded code segments
 (`RtAddFunctionTable <https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-rtladdfunctiontable>`__)
 or at least callbacks that can generate those tables on demand
 (`RtInstallFunctionTableCallback <https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-rtlinstallfunctiontablecallback>`__).
 As faults are (probably correctly) presumed rare, the expense of all the
 table walks was seen as preferable to x32's much more stack-based
 mechanism.

 x32 SEH used[#win32-seh]_ an ABI-defined offset (specifically, 0) within the
 per-thread control block reachable via the %fs segment selector to point, in
 turn, at a chain of exception handler registration records.  These registration
 records were dynamically constructed on the stack and were pushed and popped
 from the chain. Given the rarity of exceptions, it is understandable that this
 runtime cost was seen as too high.

 .. [#win32-seh]
   At least, as far as I understand `A Crash Course on the Depths of
   Win32 Structured Exception
   Handling <http://bytepointer.com/resources/pietrek_crash_course_depths_of_win32_seh.htm>`__.

 New Proposal
 ============

 To enable both O(1) runtime costs per event *and* a dynamic multiplicity
 of idempotent sequences within a program, I propose to introduce
 **Exceptional-Sequence Text Pages** (ESTP).[#estp-name]_ Such a page
 begins with a structured data header (defined below) and then contains
 the instructions of these exceptional sequences.

 .. [#estp-name]
   I'm more than open to suggestions for better names.

 Header Layout
 -------------

 Each such page (the size of pages being defined by the platform ABI)
 begins with a descriptor table, which is composed of an ABI-defined
 metadata word followed by an array of ESTP descriptors.  The metadata
 word includes the number of descriptors on this
 page.[#estp-header-limit]_

 Each descriptor is itself an ABI-defined structure, but is likely a
 64-bit value.  The fields of a descriptor are as follows.  (Bit-widths
 given in parentheses are suggestions for ABIs with paging granularity of
 up to 64KiB bytes; for 4KiB page architectures, the top four bits of
 each offset would also be zero.)

 +-------------+------------+-------------+-------------+-------------+
 | Event       | Zeros (13) | Start       | End offset  | Response    |
 | selector    |            | offset (16) | (16)        | offset (16) |
 | (3)         |            |             |             |             |
 +-------------+------------+-------------+-------------+-------------+

 The start (resp. end) offset points to the first byte of the first
 instruction within (resp. after) the sequence on this page; responses
 again are offsets that point to instructions with the current page.

 Because it may enable optimization in the kernel and is easy to do while
 constructing the ESTP page, the array of descriptors must be sorted
 lexicographically using the field order shown.

 The event selector is one of

 +---------------------+-----------------------------------------------+
 | Selector (value)    | Description                                   |
 +=====================+===============================================+
 | ``ESTP_PREEMPTION`` | Involuntary preemption (time quantum expired  |
 | (0)                 | or thread synchronization)                    |
 +---------------------+-----------------------------------------------+
 | ``ESTP_INTERRUPT``  | Attempted asynchronous interrupt / signal     |
 | (1)                 | delivery                                      |
 +---------------------+-----------------------------------------------+
 | ``ESTP_FAULT``      | Synchronous fault that cannot be              |
 | (2)                 | transparently resolved                        |
 +---------------------+-----------------------------------------------+

 When the table is scanned, the first descriptor that matches a given
 (PC, event) pair, if any, will be the one chosen for further
 processing.[#estp-span-overlap]_

 .. [#estp-header-limit]
   And we should perhaps insist that there is some sensible upper limit
   on the number of descriptors per page; a 4KiB page can, strictly, fit
   512 descriptors of the suggested 64-bit format, but that leaves no
   room for code.

 .. [#estp-span-overlap]
   Span overlap is likely best kept to equal spans and differing event
   selectors, but this is not required.

 Aside: An Alternative Table Layout
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 It is also possible that we could have the header word describe the
 number of entries for each selector.  This would remove the selector
 from the descriptors themselves and would slightly optimize the case
 that there were no descriptors of a particular event.  On the other
 hand, it limits the number of events we can use as triggers to fairly
 few; perhaps there are only a few we want, though, so this is the right
 idea.

 Events and Responses
 --------------------

 The various event selectors request transfer of control flow
 thus:[#estp-cheri-monotonicity]_

 - For a ``PREEMPTION`` descriptor, control flow will move to the
  response offset upon resumption following a preemption event or
  similar *invisible* interrupt.[#estp-preemption-visibility]_ No other
  changes to the register file will be made.

  It is not required, but one would expect it to be common, that the
  response offset is equal to the start offset. In this case, the
  exceptional code sequence must behave in an *idempotent* manner;
  typically, the last instruction within the sequence will be a store to
  "commit" the aggregate effect of the sequence.

  ``PREEMPTION`` can be useful for accessing per-CPU data structures using
  such idempotent sequences, as cross-core migration necessarily
  requires preemption.

  Presently, we **do not** propose to check the user instruction pointer
  on preemption events that happen inside system calls (just as we do
  not check return addresses within the user program).  That is,
  sequences associated with ``PREEMPTION`` descriptors should not make
  system calls.

 - For a ``FAULT`` descriptor, control flow moves to the response offset
  if any instruction within the associated sequence triggers a
  synchronous fault. No other changes to the register file are made; if
  the identity of the faulting instruction matters, multiple descriptors
  and response offsets should be used.

  ``FAULT`` descriptors are intended for constructing idempotent
  sequences that involve memory loads from unsynchronized memory. If a
  sequence's commitment point can "retroactively" validate that these
  loads have produced a sensible value, the lack of synchronization may
  not be a concern.

 - For an ``INTERRUPT`` descriptor, control flow moves to the response
  offset whenever the system *resumes* from a *visible*, asynchronous
  interrupt (such as a UNIX signal with a signal handler
  registered)[#estp-signal-visibility]_ while the thread's program
  counter was within the sequence spanned by this
  descriptor.[#estp-interrupt-address]_

  The word(s) immediately prior to the response offset includes a
  bitmask of registers to clear before the system *invokes* the
  asynchronous interrupt handler as well as the length of the response
  code.[#estp-interrupt-length]_ Tentatively, a suggested layout for a
  system with 32 registers could be

     +------------+------------------------------------+-----------------------------+
     | Zeros (16) | Response text length in bytes (16) | Register clear bitmask (32) |
     +------------+------------------------------------+-----------------------------+

  The intended use case for ``INTERRUPT`` descriptors is to be able to
  front-run the system's usual signaling mechanism, to allow, for
  example, a lightweight *safebox* to expunge sensitive contents from
  its register file before execution exits the safebox. It is unlikely
  that this mechanism is sufficient for sufficiently elaborate
  safeboxes, but it should allow the safebox to reliably engage
  additional interrupt-deferring mechanisms to cover the remainder of
  its execution.[#morello-safebox-interrupts]_

 .. [#estp-cheri-monotonicity]
   On CHERI architectures, to ensure monotonicity of execution, this
   transfer of control flow must be done by deriving a capability from
   within the current program counter capability.

 .. [#estp-preemption-visibility]
   Invisible interrupts include broadcast synchronization events like
   CheriBSD's thread_single() mechanism which is used, for example, by
   the Cornucopia implementation of CHERI revocation. Thus,
   ``PREEMPTION`` descriptors may also be used to ensure that the
   exceptional sequence completes, after some number of restarts,
   entirely within one revocation epoch.

 .. [#estp-signal-visibility]
   On UNIX, one can treat SIG_DFL signals as invisible -- they generally
   either perform no action or kill the program and in any case do not
   expose the thread's register file -- and even "old API" signal
   handlers do not, by a strict reading of POSIX, gain access to the
   registers, either.

 .. [#estp-interrupt-address]
   That is, the interrupt/signal/... handler will receive the response
   address as the address of the interrupted instruction rather than the
   address of the truly interrupted instruction within the exceptional
   sequence.

 .. [#estp-interrupt-length]
   On CHERI, the length of the response code will be used when
   constructing the PCC exposed to the interrupt handler: the PCC
   exposed to the handler will have its address set to the indicated
   response address and its length set to the indicated length, thereby
   losing access to PC-relative data outside this region.

 .. [#morello-safebox-interrupts]
   On CHERI, when entering a safebox using, for example, Morello's BLR,
   BLRR, BLRS, BR, BRS, LDPBLR, LDPBR, RET, RETR, or RETS instructions,
   the unsealed PCC (and possibly IDC) are *immediately* available in
   the register file and subject to capture by asynchronous event
   handlers (for example, within the ucontext_t structures passed to
   CheriBSD signal handlers). The domain-call ABI should ensure that the
   operands of these instructions are preserved by their action so that
   the sealed forms are available within ``INTERRUPT`` responses to
   facilitate retrying domain entry.

 Registering an ESTP With the Kernel
 -----------------------------------

 We wish there were general per-virtual-page state in the FreeBSD VM
 subsystem, as objects look to be too coarse and physical pages are
 shared across address spaces.  For the moment, then, we propose to use a
 per-address-space hash set of registered ESTP pages.[#estp-why-hash]_

 Two new ``madvise()`` behavior values can insert
 (``MADV_ESTP_REGISTER``) or remove (``MADV_ESTP_UNREGISETER``) one or
 more page(s) from this set.[#estp-why-madvise]_

 User software using ``ESTP`` must arrange to call ``madvise()``
 ``MADV_ESTP_REGISTER`` on the appropriate virtual pages before executing
 any exceptional-sequence code.  This is likely readily done with the
 existing ELF ``ctor`` machinery.  One could imagine also teaching the
 image activator about ESTP pages, but this seems excessive.

 .. [#estp-why-hash]
   Earlier versions of this proposal used a flag within a ``vm_page_t``
   structure to designate ESTP pages, but this necessitated unsharing
   text pages across address spaces, which is far from ideal.

 .. [#estp-why-madvise]
   We propose the use of ``madvise()`` rather than ``mprotect()`` or
   ``mmap()`` to avoid chewing up bits in flags fields.  The argument to
   ``madvise()`` is a selector rather than a composite, and so we feel
   more comfortable reserving entries in that namespace.

 Kernel Changes
 ~~~~~~~~~~~~~~

 Upon an event of interest to the ESTP machinery (preemption, signal
 delivery, or unhandled fault), the kernel should look up the user
 program counter's page in this set.  If found, the kernel must traverse
 the descriptors at the front of the ESTP page; it first loads the 0th
 word of the page to read the ESTP descriptor table metadata, and then
 scans the array of descriptors.  We suspect it will be simplest to use
 the ``fu[e]\*`` functions to load each descriptor in a loop rather than
 wire the page.  We are requiring the ESTP descriptors to be sorted in a
 way that is amenable to binary search, but this is unlikely to hugely
 matter and may in fact be detrimental given the expected small size of
 the array.

 Using ESTPs Within the Kernel
 -----------------------------

 This mechanism may also be of interest to the kernel itself. The various
 ``copy{in,out}`` and ``f[su]\.*`` functions may be placed on an ESTP and
 so avoid the need to register and unregister themselves with the
 ``pcb_onfault`` mechanism. The O(1) lookup in the fault path is also
 significantly nicer than the proposed "fault lookaside table" of `this
 earlier pull request
 <https://github.com/CTSRD-CHERI/cheribsd/pull/1159>`__.

 Care would be required to ensure that ESTPs are not used within the code
 resolving ESTPs or in code that holds the same locks, especially if
 ``PREEMPTION`` descriptors are desired. However, the kernel has other,
 well-trodden mechanisms for interacting with its own preemptive
 scheduler, and it has little use for ``INTERRUPT`` descriptors'
 responses, and so in-kernel ESTPs are likely sensibly limited to
 ``FAULT`` descriptors.  At least the FreeBSD kernel is already
 architected to avoid triggering non-resolvable faults while accessing
 its own internal data structures, so the fault-handling code is
 generally free to acquire the pmap locks required for ESTP resolution.

 As an aside, if ``PREEMPTION`` descriptors do, in fact, turn out to be
 desirable in the kernel, then the kernel should manipulate its own ESTP
 hash table set with preemption disabled.  That way, the preemptive
 switcher can read from this set without what would amount to recursive
 access.

 Checking Desiderata
 -------------------

 1. ESTP works with disparate software within the same process: each
   consumer of ESTP provides its own page(s). Registration with the
   kernel need be done only at load time, and deregistration is implicit
   at unload time, when the pages are unmapped.

 2. There is no instruction overhead for entering or exiting an ESTP
   sequence, though there is cost in branching to and from. See below.

 3. Preemption must test whether the page holding the preempted program
   counter is an ESTP and, if so, whether there is a ``PREEMPTION``
   descriptor covering that address. This is O(1), if with a larger than
   desirable constant; see below.

 4. It is not possible to execute instructions on an ESTP page without
   also becoming subject to the ESTP descriptors on that page.

 Reasons All is Not Obviously Well
 ---------------------------------

 The use of page-aligned descriptor tables complicates code generation
 and linkage, probably limiting ESTPs to hand-written assembler with
 implied entry and exit costs of branching to and from these exceptional
 sequences. It is possible to have code not covered by ESTP descriptors
 within an ESTP page, so it may be possible to amortize the branch costs
 by moving tight loops around the exceptional sequence, but again, this
 relies on hand-written assembler. (On the other hand, in some use cases
 which ESTP would replace, the existing approaches also use hand-written
 assembler and incur these branch costs.)

 This drawback stems from the central role played by pages in this
 design. Fundamentally, we are exploiting page *metadata* to recognize
 exceptional sequences and page *alignment* as a convenient map from
 program counters to the set of possible governing descriptors. It is
 possible that superior recognition and mapping strategies exist, and in
 some sense the proposal here is severable into its notion of descriptors
 tightly coupled to particular code sequences and the particular layout
 of memory that performs said coupling.

 Worked Examples
 ===============

 General Interlocks With Preemption
 ----------------------------------

 One can use ``PREEMPTION`` ESTP descriptors to cover the existing rseq
 use cases, and the only real challenges are...

 -  the necessity of moving all restartable sequence(s) onto one or more
   dedicated pages, together with their ESTP descriptors and

 -  registering the ESTP page(s) with the kernel sufficiently early in
   program startup, which can be done with ctors for most code or
   special-purpose code in rtld should it use ESTPs in startup.

 These do not seem insurmountable.

 .. _snmalloc-pagemap-race:

 Revoker-interlocked Type Stability in snmalloc PageMap
 ------------------------------------------------------

 Racing double-frees in snmalloc pose an especially interesting
 challenge, since the *type* of data in the PageMap's MetaEntry for a
 given region may change once the owning thread (which is uniquely
 allowed to mutate the MetaEntry) has learned that there are no *live*
 objects within that range. In the case of a racing double-free, this
 type change may happen even while another thread is accessing the
 MetaEntry. If a chain of pointers from the MetaEntry must be followed
 early in deallocation (say, to access the associated arena or shadow
 capabilities), then this leaves racing deallocations by threads that do
 not own the region in question open to UAR of the allocator's metadata.
 The race can almost be closed, modulo a small residual TOCTTOU gap, by
 ensuring that these type shifts occur only after revocation has expunged
 all capabilities into a range before permitting differently typed reuse
 of its MetaEntry and having deallocation check the revokedness of its
 argument. The TOCTTOU gap may then be closed by use of an idempotent
 sequence consisting of this check and the chained loads guarded under a
 ``PREEMPTION`` ESTP descriptor, as the beginning of a revocation epoch
 will act like a preemption event.

 snmalloc Safebox Entry on Morello
 ---------------------------------

 Let's assume that we control both sides of the function entry. In
 practice, we will want to (or have to) ascribe to the domain-crossing
 ABI, but for the moment there isn't one, so we can pretend.

 On the one hand, snmalloc is an ideal stress test of this machinery
 since it's going to exercise basically everything. On the other hand,
 it's a little tough because either we're going to have to teach the
 compiler how to generate these exceptional sequences or we're going to
 have to (re)write a whole lot of what is currently a very pleasantly
 portable C++ codebase into assembler.

 A caller that considers snmalloc trusted will...

 1. fetch its sealed Allocator this pointer from TLS into c8,

 2. fetch its sealed code pointer to, say, dealloc(), into c9, and

 3. use BLRS (pair) to jump and unseal both (BLRS c29, c8, c9).

 BLRS will store back a sentry to the link register (c30), unseal c8 into
 c29, and unseal c9 into pc. (Despite the surface syntax, c29 is a fixed
 choice of the Morello architecture, like c30, and is not a selectable
 operand of BLRS). Importantly, c8 and c9 are not clobbered and are
 already defined by the aarch64 ABI as caller-save.[#safebox-abi-regs]_

 .. [#safebox-abi-regs]
   It may make more sense to use c16 and c17, reserved by the ABI for
   call veneers?

 Immediately upon entry to the snmalloc safebox, we must be defensive
 against both faults and interrupts (async signals), as either event
 risks exposing the unsealed interior (but preemption is not a concern).
 If we are careful about the program we write, we can decide that faults
 are impossible and should be fatal, bringing down the entire process.
 However, async signals do not offer such a convenient story. This early
 in the entry, we simply want async signals to have happened *before*
 domain transition rather than after, and so we need to arrange to rewind
 time.

 Our entry-point must fit all its storage requirements into the
 caller-saved registers (other than c8 and c9, so c10 through c15) and/or
 the intra-procedure scratch registers (c16 and c17). Its ``INTERRUPT`` ESTP
 descriptor must direct the kernel to clear c10 through c17 and c29, to erase
 its unsealed capabilities and any progeny thereof, and must point the
 response at a single (4-byte) instruction: BRS c29, c8, c9. Note that
 this instruction does not alter c30, so the state of the machine after
 its execution is equivalent to the state after the *caller*\ 's BLRS
 c29, c8, c9: the link register continues to be a sentry capability back
 to the caller. The kernel's derivation and bounding of the exposed
 interrupted PCC to exactly the 4-byte span of the BRS instruction means
 that sigreturn() can execute this instruction without the signal handler
 getting its hands on any PC-relative state from snmalloc's safebox.

 Unfortunately, extending the ESTP machinery to cover the entirety of
 snmalloc's code is tricky. Therefore, we expect to see the entry
 trampoline to exit out from ESTP's cover once it has...

 -  acquired a stack using an atomic xchg operation with a known offset
   within the Allocator structure. This should store NULL; if it reads
   NULL, the API is being abused and a nominally single-threaded
   Allocator has been invoked by two threads at once.

 -  arranged for the
   `sigfastblock(2) <https://www.freebsd.org/cgi/man.cgi?query=sigfastblock>`__
   mechanism to be in effect, deferring async signals until exit from
   the safebox.[#sigfastblock-recurse]_

 ESTP may be used again, though for ``FAULT`` or ``PREEMPTION`` handling
 only, when snmalloc traverses some internal data structures.
 ``PREEMPTION`` allows snmalloc to cheaply synchronize with the revoker,
 ensuring that the ESTP-covered sequence(s) run to completion within a
 revocation epoch. It is unlikely that ESTP is useful to exit the
 safebox: the exit path can restore the stack, clear all sensitive state,
 and then drop the sigfastblock guard against async signals before
 returning to the caller.

 .. [#sigfastblock-recurse]
   sigfastblock is recursive, and so even in the case that snmalloc
   enters rtld, it should all work out.

 Approximating Conditional Load Instructions
 -------------------------------------------

 It does not come up often, but there are cases where one is working with
 elevated authority on behalf of some other agent and so one may want to
 perform a load through capability A only if capability B is also tagged
 and permissive of loads. ESTP ``FAULT`` records permit emulation of such a
 conditional load instruction: the ``FAULT`` response path can check
 capability B and respond appropriately, by continuing execution at the
 next instruction or jumping out of the exceptional sequence or, if B is
 indeed still tagged, executing a copy of the faulting load instruction
 without ESTP guard to commit to the fault (!).

 Similarly, one can use ESTP ``FAULT`` records to approximate conditional
 load instructions that test the LL reservation rather than the
 permissiveness of a second capability. (Modern architectures are
 generally forgiving of additional memory operations during LL/SC, so
 long as those additional operations avoid the reservation.) This can be
 approximated with an ordinary load instruction under an ESTP ``FAULT``
 record whose response offset points at code which breaks the reservation
 and then resumes execution at the instruction after the load. The result
 under-approximates the set of faults that "conditional load" could
 deliver, in that it approximates all faults as no-ops due to the
 reservation being broken (by explicitly breaking it if needed), but it
 may still suffice.

 Alternative Proposals and Rebuttals
 ===================================

 In-Band NOPs
 ------------

 An earlier proposal suggested placing NOPs after instructions to encode
 what this proposal encodes in the ESTP descriptors. Assessing the
 desiderata...

 1. This proposal works with modular software.

 2. There is no entry or exit cost; exceptional sequences can be inlined
   anywhere, given a sufficiently rich taxonomy of NOPs.

 3. It's very easy for the fault and preemption paths to test for the
   need for special handling (a single copyin of the next instruction).

 4. The metadata is overtly intrinsic to the instruction stream.

 The advantages of such a scheme over ESTP are...

 -  Nothing funky going on with pages and alignment

 -  Very short instruction sequences need very little metadata, possibly
   less than an ESTP descriptor!

 -  Reading a virtual memory location adjacent to PC is faster and easier
   than reading per-page metadata.

 while the disadvantages are...

 -  The need for architectural NOPs reserved for software's
   introspection. Some architectures, like RISC-V, which have a large
   swath of NOPs available as HINTs, may be willing to perform such
   carveouts in their ISAs; some, like Arm, seem much less likely to do
   so without a very compelling use, and it's not clear that this rises
   to that threshold.

 -  Increased I$ pressure from these NOPs (though in many cases the
   impact would be less than or equal to the existing I$ cost of stores
   to perform sequence entry and exit, nevermind the additional D$
   operations).

 Other points of note:

 -  If too few NOPs are available to encode the entirety of the response,
   indirection may be an option: have the in-band NOPs point at
   something like an ESTP descriptor (or descriptor table). Even more
   extreme forms may be to use a single sigil NOP either

   -  followed by an unconditional jump over the response descriptor(s)

   -  to indicate that a full ESTP descriptor table exists at the start
      of this page(!) and should be consulted for the instruction in
      question.

 -  This scheme is, like ESTP, viable for both user and kernel code.

 -  The existing proposal for this design placed the NOPs after the
   instruction in question. It may be slightly better to place them
   before, as that would not require the kernel to parse out the
   instruction width.

 Look-Aside Tables for Exceptional Sequences in the Kernel
 ---------------------------------------------------------

 Abandoning the rseq use case for synchronizing user code against
 preemption and, instead, focusing only *within the kernel* and on
 synchronous faults and asynchronous interrupts, an earlier proposal
 generated an out-of-band linkerset for structures that approximate the
 ESTP descriptors described herein. Assessing the desiderata...

 1. Modularity was achievable but relied on being able to find the
   linkerset for each module. This was just shy of straightforward in
   the targeted use case of the kernel.

 2. There is no entry or exit cost and the sequences can be inlined
   anywhere.

 3. Because the mechanism focused only on faults, its only impact was in
   the very slow-path case where the fault could not be transparently
   handled by the kernel. Adding asynchronous event support would have
   impacted signal delivery, a somewhat less rare event. The kernel has
   other mechanisms to deal with preemption, and so its absence was not
   seen as a problem.

 4. The lookaside linkerset is reliably associated with the code of
   exceptional sequences.

 This mechanism probably cannot be extended either to user code or to
 include preemption:

 -  It would be significantly harder in the case of a userspace program
   to find the appropriate object's linkerset in the first place; the
   kernel has very little idea about "objects" in this sense. The result
   would be somewhat akin to Windows SEH, of either the x86 or x64
   variety.

 -  The cost of finding and scanning the linkerset, on every preemption
   event especially, seems prohibitive.