Skip to content

Instantly share code, notes, and snippets.

@nwf
Forked from nwf-msr/Exceptional Sequence Text Pages
Created September 17, 2024 20:21
Show Gist options
  • Save nwf/cbd790433ddaa3319060efc23c73874b to your computer and use it in GitHub Desktop.
Save nwf/cbd790433ddaa3319060efc23c73874b to your computer and use it in GitHub Desktop.
###############################
Exceptional-Sequence Text Pages
###############################
Introduction
============
Occasionally, software wants to impose interesting out-of-band
constraints on small, "exceptional" sequences of instructions. Examples
of such constraints include needing to run...
- to completion without preemption: if thread scheduling takes this
sequence off-core, the sequence should be restarted, rather than
continued, upon scheduling resuming this thread.
- without asynchronous interruption, more generally than preemption, by
either restarting the sequence (as with preemption) or by *catching
the interruption* with specialized response.
- with specialized responses to *synchronous* faults.
Such sequences are generally short, hand-written in assembler, and have
few basic blocks. In general, the set of such sequences is static per
loadable object, but it is not necessarily constant across a *process's*
lifetime. It seems likely that, if there are more than zero such
sequences within a library or process, there will also be more than one.
Desiderata
==========
1. Multiple pieces of software, unknown to each other, within one
process should be able to avail themselves of such exceptional
sequences. This extends to dynamically loaded (or perhaps even JIT
generated) code.
2. These sequences are generally very short, perhaps as short as a
single instruction, and so entry and exit from exceptional sequences
should be as close as possible to 0-instruction overhead and,
especially, should avoid mutating memory. (The brevity of these
sequences means that we do not expect them to nest or to include
calls to other functions.)
3. Context switch happens very often and is already generally considered
"too expensive" as it stands, so testing *whether* the currently
executing code needs special response for preemption should be as
fast as possible, particularly in the (most common) case that the
answer is "no."
This is less of a concern for non-preemption events, such as
synchronous faults that cannot be transparently handled or
asynchronous interrupts which redirect control flow, as these events
are rare and, typically, already have fairly high response costs.
4. The exceptionalness of these sequences is intrinsic to the sequence
of instructions; it should not be possible to sometimes run the
instructions as part of an exceptional sequence and sometimes not.
This is perhaps merely aesthetic.
Existing Approaches
===================
Idempotent Sequences, Historically
----------------------------------
Apparently first introduced as "restartable atomic sequences" by
`Bershad et al. Fast Mutual Exclusion for Uniprocessors (1992)
<https://dl.acm.org/doi/pdf/10.1145/143371.143523>`__, these are regions
of code that run to completion entirely on one side or the other of
preemption. The two implementations discussed in that paper are Mach's,
which permits an address space to have exactly one restartable atomic
sequence, and Taos's, which permits multiple sequences but recognizes
them by parsing the instruction stream.
On quite old ARM processors, a popular trick was to use idempotent
sequence machinery to implement atomic operations. See, for example,
`Linux's documentation of their implementation
<https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt>`__.
Existing Special Handling within the FreeBSD Kernel
---------------------------------------------------
Within the FreeBSD kernel, there are two mechanisms used for specialized
subsets of our aim here.
"pcb_onfault"
~~~~~~~~~~~~~
Each thread control block contains a "``pcb_onfault``" field. If this is
not ``NULL``, the processor is running in the kernel, and the system
encounters a nonresolvable fault, control flow will be redirected to
this address. This mechanism is not used to synchronize with preemption
or for asynchronous events. That aside, considering our desiderata,
1. This mechanism supports modular software.
2. This mechanism requires two stores per executed exceptional sequence,
and in many cases those sequences are shorter than this combined
entry and exit cost. In other cases, the entry and exit costs are
amortized and include many instructions that either cannot fault or
for which a fault should indeed panic the machine.
3. This mechanism does not interact with context switch. It interacts
only with the nonresolvable fault path, on which it performs a load,
a conditional branch, and a possible store, making it quite
lightweight.
4. It is, in principle, possible to bypass the entry instructions for an
exceptional sequence, but it is hard to imagine what utility would be
gained.
Hard-coded Program Counter Address Comparisons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the platform support code for arm64, the low-level "external abort"
handling machinery involves a function named "``test_bs_fault``" that
explicitly compares the faulting program counter to the address of eight
symbols from within the kernel.
1. This is patently non-modular, requiring a priori knowledge of all
code needing special handling.
2. Entry and exit are, indeed, free, and these symbols, once added to
the set considered, can be located anywhere in the kernel.
3. This mechanism, similarly, does not interact with context switch. It
sits entirely on a very rare path used generally while probing
hardware and not at all in the steady state.
4. The special handling is tied to the instruction's location and so
cannot be severed.
Linux's rseq API
----------------
The kernel code is available in
- `linux/uapi/rseq.h <https://github.com/torvalds/linux/blob/master/include/uapi/linux/rseq.h>`__
describes the userspace interface
- `rseq.c <https://github.com/torvalds/linux/blob/master/kernel/rseq.c>`__
contains the bulk of the associated kernel code
- `membarrier.c <https://github.com/torvalds/linux/blob/master/kernel/sched/membarrier.c>`__
also has grown support for restarting rseqs, either generally or on a
specific CPU.
See `The 5-year journey to bring restartable sequences to Linux -
EfficiOS <https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/>`__
and `Restartable sequences restarted
[LWN.net] <https://lwn.net/Articles/697979/>`__ for writeups that cover
some older editions of the API, as well.
Each thread has at most one "``struct rseq``" associated with it, though
repeated registrations are *counted*.[#rseq-sign]_ Each "``struct
rseq``" points to at most one "``struct rseq_cs``" which describes the
code region for the restartable sequence, giving its start, end, and
abort locations in the executable. Software must store into the ``struct
rseq_cs`` pointer within its registered ``struct rseq`` before entering
each restartable sequence and should perform an additional store to
clear the pointer after exiting a restartable sequence.
The kernel consults the registered ``struct rseq`` only on preemption,
migration, and signal delivery. Flags within the ``struct rseq`` (not ``struct
rseq_cs``!) can disable the control-flow transfer response of each event
source; this feature of the API is intended for use by debuggers.
Speaking of debuggers, ptrace has been extended to support reading the
current rseq registration.
.. [#rseq-sign]
Registrations are also "signed" with a 32-bit value provided to the
system call that must also be provided at un-registration and must
equal the 32-bit value prior to the rseq's abort address. Diagnostic
messages in the Linux kernel suggest that this is an attempt to
prevent attackers from clobbering an active ``struct req_cs``
pointer.
Checking Desiderata
~~~~~~~~~~~~~~~~~~~
1. By slightly extending the *ABI*, so that there is a well-known name
for *the* "``struct rseq``" managed by TLS machinery, multiple
disparate pieces of software can use this mechanism. However,
changing the kernel's notion of registered "``struct rseq``" is an
ambient authority (a system call) guarded by a 32-bit number that
must also be shared between the disparate pieces of software, rather
like the "secret" address space layout randomization (ASLR) slide.
2. This API requires one store on the entry to a rseq-guarded sequence:
the pointer to the "``struct rseq_cs``" must be stored into the
"``struct rseq``". Because the "``struct rseq_cs``" describes not
just the restart address but also the start and end of the sequence,
it is not strictly necessary to clear this pointer at the end of the
sequence, but nevertheless the API documentation suggests doing so.
Even one store is more than we would like to pay, but it is
apparently tolerable.
3. With this API, preemption now must
a. Test whether the thread has a "``struct rseq``" associated with it
and, if so,
b. Load the "``struct rseq_cs``" pointer,
c. Update the event counters within the "``struct rseq``",
d. If there is a "``struct rseq_cs``" indicated, load it,
e. Check whether the preempted address is within the indicated
bounds, and, if so
f. Replace the preempted address with the indicated restart address.
This is fairly lightweight: a few loads, a store, and a few
conditional branches.
4. It is, in principle, possible to bypass the store of the "``struct
rseq_cs``" pointer and still execute the instructions of the
sequence, disassociating them from their intended rseq use, but,
again, it is not clear what advantage this would give.
FreeBSD's rseq proposal
-----------------------
See `D32505 Add rseq(2) <https://reviews.freebsd.org/D32505>`__. It is a
direct port of the Linux API.
Windows' Structured Exception Handling (SEH)
--------------------------------------------
Windows SEH is useful for managing synchronous faults and may be useful
for asynchronous interrupts in principle, but it is far too heavy weight
for interacting with preemption a la rseq. In any case, at least as of
x64, it uses lookaside tables to drive its response and seems to require
ahead of time registration of loaded code segments
(`RtAddFunctionTable <https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-rtladdfunctiontable>`__)
or at least callbacks that can generate those tables on demand
(`RtInstallFunctionTableCallback <https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-rtlinstallfunctiontablecallback>`__).
As faults are (probably correctly) presumed rare, the expense of all the
table walks was seen as preferable to x32's much more stack-based
mechanism.
x32 SEH used[#win32-seh]_ an ABI-defined offset (specifically, 0) within the
per-thread control block reachable via the %fs segment selector to point, in
turn, at a chain of exception handler registration records. These registration
records were dynamically constructed on the stack and were pushed and popped
from the chain. Given the rarity of exceptions, it is understandable that this
runtime cost was seen as too high.
.. [#win32-seh]
At least, as far as I understand `A Crash Course on the Depths of
Win32 Structured Exception
Handling <http://bytepointer.com/resources/pietrek_crash_course_depths_of_win32_seh.htm>`__.
New Proposal
============
To enable both O(1) runtime costs per event *and* a dynamic multiplicity
of idempotent sequences within a program, I propose to introduce
**Exceptional-Sequence Text Pages** (ESTP).[#estp-name]_ Such a page
begins with a structured data header (defined below) and then contains
the instructions of these exceptional sequences.
.. [#estp-name]
I'm more than open to suggestions for better names.
Header Layout
-------------
Each such page (the size of pages being defined by the platform ABI)
begins with a descriptor table, which is composed of an ABI-defined
metadata word followed by an array of ESTP descriptors. The metadata
word includes the number of descriptors on this
page.[#estp-header-limit]_
Each descriptor is itself an ABI-defined structure, but is likely a
64-bit value. The fields of a descriptor are as follows. (Bit-widths
given in parentheses are suggestions for ABIs with paging granularity of
up to 64KiB bytes; for 4KiB page architectures, the top four bits of
each offset would also be zero.)
+-------------+------------+-------------+-------------+-------------+
| Event | Zeros (13) | Start | End offset | Response |
| selector | | offset (16) | (16) | offset (16) |
| (3) | | | | |
+-------------+------------+-------------+-------------+-------------+
The start (resp. end) offset points to the first byte of the first
instruction within (resp. after) the sequence on this page; responses
again are offsets that point to instructions with the current page.
Because it may enable optimization in the kernel and is easy to do while
constructing the ESTP page, the array of descriptors must be sorted
lexicographically using the field order shown.
The event selector is one of
+---------------------+-----------------------------------------------+
| Selector (value) | Description |
+=====================+===============================================+
| ``ESTP_PREEMPTION`` | Involuntary preemption (time quantum expired |
| (0) | or thread synchronization) |
+---------------------+-----------------------------------------------+
| ``ESTP_INTERRUPT`` | Attempted asynchronous interrupt / signal |
| (1) | delivery |
+---------------------+-----------------------------------------------+
| ``ESTP_FAULT`` | Synchronous fault that cannot be |
| (2) | transparently resolved |
+---------------------+-----------------------------------------------+
When the table is scanned, the first descriptor that matches a given
(PC, event) pair, if any, will be the one chosen for further
processing.[#estp-span-overlap]_
.. [#estp-header-limit]
And we should perhaps insist that there is some sensible upper limit
on the number of descriptors per page; a 4KiB page can, strictly, fit
512 descriptors of the suggested 64-bit format, but that leaves no
room for code.
.. [#estp-span-overlap]
Span overlap is likely best kept to equal spans and differing event
selectors, but this is not required.
Aside: An Alternative Table Layout
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is also possible that we could have the header word describe the
number of entries for each selector. This would remove the selector
from the descriptors themselves and would slightly optimize the case
that there were no descriptors of a particular event. On the other
hand, it limits the number of events we can use as triggers to fairly
few; perhaps there are only a few we want, though, so this is the right
idea.
Events and Responses
--------------------
The various event selectors request transfer of control flow
thus:[#estp-cheri-monotonicity]_
- For a ``PREEMPTION`` descriptor, control flow will move to the
response offset upon resumption following a preemption event or
similar *invisible* interrupt.[#estp-preemption-visibility]_ No other
changes to the register file will be made.
It is not required, but one would expect it to be common, that the
response offset is equal to the start offset. In this case, the
exceptional code sequence must behave in an *idempotent* manner;
typically, the last instruction within the sequence will be a store to
"commit" the aggregate effect of the sequence.
``PREEMPTION`` can be useful for accessing per-CPU data structures using
such idempotent sequences, as cross-core migration necessarily
requires preemption.
Presently, we **do not** propose to check the user instruction pointer
on preemption events that happen inside system calls (just as we do
not check return addresses within the user program). That is,
sequences associated with ``PREEMPTION`` descriptors should not make
system calls.
- For a ``FAULT`` descriptor, control flow moves to the response offset
if any instruction within the associated sequence triggers a
synchronous fault. No other changes to the register file are made; if
the identity of the faulting instruction matters, multiple descriptors
and response offsets should be used.
``FAULT`` descriptors are intended for constructing idempotent
sequences that involve memory loads from unsynchronized memory. If a
sequence's commitment point can "retroactively" validate that these
loads have produced a sensible value, the lack of synchronization may
not be a concern.
- For an ``INTERRUPT`` descriptor, control flow moves to the response
offset whenever the system *resumes* from a *visible*, asynchronous
interrupt (such as a UNIX signal with a signal handler
registered)[#estp-signal-visibility]_ while the thread's program
counter was within the sequence spanned by this
descriptor.[#estp-interrupt-address]_
The word(s) immediately prior to the response offset includes a
bitmask of registers to clear before the system *invokes* the
asynchronous interrupt handler as well as the length of the response
code.[#estp-interrupt-length]_ Tentatively, a suggested layout for a
system with 32 registers could be
+------------+------------------------------------+-----------------------------+
| Zeros (16) | Response text length in bytes (16) | Register clear bitmask (32) |
+------------+------------------------------------+-----------------------------+
The intended use case for ``INTERRUPT`` descriptors is to be able to
front-run the system's usual signaling mechanism, to allow, for
example, a lightweight *safebox* to expunge sensitive contents from
its register file before execution exits the safebox. It is unlikely
that this mechanism is sufficient for sufficiently elaborate
safeboxes, but it should allow the safebox to reliably engage
additional interrupt-deferring mechanisms to cover the remainder of
its execution.[#morello-safebox-interrupts]_
.. [#estp-cheri-monotonicity]
On CHERI architectures, to ensure monotonicity of execution, this
transfer of control flow must be done by deriving a capability from
within the current program counter capability.
.. [#estp-preemption-visibility]
Invisible interrupts include broadcast synchronization events like
CheriBSD's thread_single() mechanism which is used, for example, by
the Cornucopia implementation of CHERI revocation. Thus,
``PREEMPTION`` descriptors may also be used to ensure that the
exceptional sequence completes, after some number of restarts,
entirely within one revocation epoch.
.. [#estp-signal-visibility]
On UNIX, one can treat SIG_DFL signals as invisible -- they generally
either perform no action or kill the program and in any case do not
expose the thread's register file -- and even "old API" signal
handlers do not, by a strict reading of POSIX, gain access to the
registers, either.
.. [#estp-interrupt-address]
That is, the interrupt/signal/... handler will receive the response
address as the address of the interrupted instruction rather than the
address of the truly interrupted instruction within the exceptional
sequence.
.. [#estp-interrupt-length]
On CHERI, the length of the response code will be used when
constructing the PCC exposed to the interrupt handler: the PCC
exposed to the handler will have its address set to the indicated
response address and its length set to the indicated length, thereby
losing access to PC-relative data outside this region.
.. [#morello-safebox-interrupts]
On CHERI, when entering a safebox using, for example, Morello's BLR,
BLRR, BLRS, BR, BRS, LDPBLR, LDPBR, RET, RETR, or RETS instructions,
the unsealed PCC (and possibly IDC) are *immediately* available in
the register file and subject to capture by asynchronous event
handlers (for example, within the ucontext_t structures passed to
CheriBSD signal handlers). The domain-call ABI should ensure that the
operands of these instructions are preserved by their action so that
the sealed forms are available within ``INTERRUPT`` responses to
facilitate retrying domain entry.
Registering an ESTP With the Kernel
-----------------------------------
We wish there were general per-virtual-page state in the FreeBSD VM
subsystem, as objects look to be too coarse and physical pages are
shared across address spaces. For the moment, then, we propose to use a
per-address-space hash set of registered ESTP pages.[#estp-why-hash]_
Two new ``madvise()`` behavior values can insert
(``MADV_ESTP_REGISTER``) or remove (``MADV_ESTP_UNREGISETER``) one or
more page(s) from this set.[#estp-why-madvise]_
User software using ``ESTP`` must arrange to call ``madvise()``
``MADV_ESTP_REGISTER`` on the appropriate virtual pages before executing
any exceptional-sequence code. This is likely readily done with the
existing ELF ``ctor`` machinery. One could imagine also teaching the
image activator about ESTP pages, but this seems excessive.
.. [#estp-why-hash]
Earlier versions of this proposal used a flag within a ``vm_page_t``
structure to designate ESTP pages, but this necessitated unsharing
text pages across address spaces, which is far from ideal.
.. [#estp-why-madvise]
We propose the use of ``madvise()`` rather than ``mprotect()`` or
``mmap()`` to avoid chewing up bits in flags fields. The argument to
``madvise()`` is a selector rather than a composite, and so we feel
more comfortable reserving entries in that namespace.
Kernel Changes
~~~~~~~~~~~~~~
Upon an event of interest to the ESTP machinery (preemption, signal
delivery, or unhandled fault), the kernel should look up the user
program counter's page in this set. If found, the kernel must traverse
the descriptors at the front of the ESTP page; it first loads the 0th
word of the page to read the ESTP descriptor table metadata, and then
scans the array of descriptors. We suspect it will be simplest to use
the ``fu[e]\*`` functions to load each descriptor in a loop rather than
wire the page. We are requiring the ESTP descriptors to be sorted in a
way that is amenable to binary search, but this is unlikely to hugely
matter and may in fact be detrimental given the expected small size of
the array.
Using ESTPs Within the Kernel
-----------------------------
This mechanism may also be of interest to the kernel itself. The various
``copy{in,out}`` and ``f[su]\.*`` functions may be placed on an ESTP and
so avoid the need to register and unregister themselves with the
``pcb_onfault`` mechanism. The O(1) lookup in the fault path is also
significantly nicer than the proposed "fault lookaside table" of `this
earlier pull request
<https://github.com/CTSRD-CHERI/cheribsd/pull/1159>`__.
Care would be required to ensure that ESTPs are not used within the code
resolving ESTPs or in code that holds the same locks, especially if
``PREEMPTION`` descriptors are desired. However, the kernel has other,
well-trodden mechanisms for interacting with its own preemptive
scheduler, and it has little use for ``INTERRUPT`` descriptors'
responses, and so in-kernel ESTPs are likely sensibly limited to
``FAULT`` descriptors. At least the FreeBSD kernel is already
architected to avoid triggering non-resolvable faults while accessing
its own internal data structures, so the fault-handling code is
generally free to acquire the pmap locks required for ESTP resolution.
As an aside, if ``PREEMPTION`` descriptors do, in fact, turn out to be
desirable in the kernel, then the kernel should manipulate its own ESTP
hash table set with preemption disabled. That way, the preemptive
switcher can read from this set without what would amount to recursive
access.
Checking Desiderata
-------------------
1. ESTP works with disparate software within the same process: each
consumer of ESTP provides its own page(s). Registration with the
kernel need be done only at load time, and deregistration is implicit
at unload time, when the pages are unmapped.
2. There is no instruction overhead for entering or exiting an ESTP
sequence, though there is cost in branching to and from. See below.
3. Preemption must test whether the page holding the preempted program
counter is an ESTP and, if so, whether there is a ``PREEMPTION``
descriptor covering that address. This is O(1), if with a larger than
desirable constant; see below.
4. It is not possible to execute instructions on an ESTP page without
also becoming subject to the ESTP descriptors on that page.
Reasons All is Not Obviously Well
---------------------------------
The use of page-aligned descriptor tables complicates code generation
and linkage, probably limiting ESTPs to hand-written assembler with
implied entry and exit costs of branching to and from these exceptional
sequences. It is possible to have code not covered by ESTP descriptors
within an ESTP page, so it may be possible to amortize the branch costs
by moving tight loops around the exceptional sequence, but again, this
relies on hand-written assembler. (On the other hand, in some use cases
which ESTP would replace, the existing approaches also use hand-written
assembler and incur these branch costs.)
This drawback stems from the central role played by pages in this
design. Fundamentally, we are exploiting page *metadata* to recognize
exceptional sequences and page *alignment* as a convenient map from
program counters to the set of possible governing descriptors. It is
possible that superior recognition and mapping strategies exist, and in
some sense the proposal here is severable into its notion of descriptors
tightly coupled to particular code sequences and the particular layout
of memory that performs said coupling.
Worked Examples
===============
General Interlocks With Preemption
----------------------------------
One can use ``PREEMPTION`` ESTP descriptors to cover the existing rseq
use cases, and the only real challenges are...
- the necessity of moving all restartable sequence(s) onto one or more
dedicated pages, together with their ESTP descriptors and
- registering the ESTP page(s) with the kernel sufficiently early in
program startup, which can be done with ctors for most code or
special-purpose code in rtld should it use ESTPs in startup.
These do not seem insurmountable.
.. _snmalloc-pagemap-race:
Revoker-interlocked Type Stability in snmalloc PageMap
------------------------------------------------------
Racing double-frees in snmalloc pose an especially interesting
challenge, since the *type* of data in the PageMap's MetaEntry for a
given region may change once the owning thread (which is uniquely
allowed to mutate the MetaEntry) has learned that there are no *live*
objects within that range. In the case of a racing double-free, this
type change may happen even while another thread is accessing the
MetaEntry. If a chain of pointers from the MetaEntry must be followed
early in deallocation (say, to access the associated arena or shadow
capabilities), then this leaves racing deallocations by threads that do
not own the region in question open to UAR of the allocator's metadata.
The race can almost be closed, modulo a small residual TOCTTOU gap, by
ensuring that these type shifts occur only after revocation has expunged
all capabilities into a range before permitting differently typed reuse
of its MetaEntry and having deallocation check the revokedness of its
argument. The TOCTTOU gap may then be closed by use of an idempotent
sequence consisting of this check and the chained loads guarded under a
``PREEMPTION`` ESTP descriptor, as the beginning of a revocation epoch
will act like a preemption event.
snmalloc Safebox Entry on Morello
---------------------------------
Let's assume that we control both sides of the function entry. In
practice, we will want to (or have to) ascribe to the domain-crossing
ABI, but for the moment there isn't one, so we can pretend.
On the one hand, snmalloc is an ideal stress test of this machinery
since it's going to exercise basically everything. On the other hand,
it's a little tough because either we're going to have to teach the
compiler how to generate these exceptional sequences or we're going to
have to (re)write a whole lot of what is currently a very pleasantly
portable C++ codebase into assembler.
A caller that considers snmalloc trusted will...
1. fetch its sealed Allocator this pointer from TLS into c8,
2. fetch its sealed code pointer to, say, dealloc(), into c9, and
3. use BLRS (pair) to jump and unseal both (BLRS c29, c8, c9).
BLRS will store back a sentry to the link register (c30), unseal c8 into
c29, and unseal c9 into pc. (Despite the surface syntax, c29 is a fixed
choice of the Morello architecture, like c30, and is not a selectable
operand of BLRS). Importantly, c8 and c9 are not clobbered and are
already defined by the aarch64 ABI as caller-save.[#safebox-abi-regs]_
.. [#safebox-abi-regs]
It may make more sense to use c16 and c17, reserved by the ABI for
call veneers?
Immediately upon entry to the snmalloc safebox, we must be defensive
against both faults and interrupts (async signals), as either event
risks exposing the unsealed interior (but preemption is not a concern).
If we are careful about the program we write, we can decide that faults
are impossible and should be fatal, bringing down the entire process.
However, async signals do not offer such a convenient story. This early
in the entry, we simply want async signals to have happened *before*
domain transition rather than after, and so we need to arrange to rewind
time.
Our entry-point must fit all its storage requirements into the
caller-saved registers (other than c8 and c9, so c10 through c15) and/or
the intra-procedure scratch registers (c16 and c17). Its ``INTERRUPT`` ESTP
descriptor must direct the kernel to clear c10 through c17 and c29, to erase
its unsealed capabilities and any progeny thereof, and must point the
response at a single (4-byte) instruction: BRS c29, c8, c9. Note that
this instruction does not alter c30, so the state of the machine after
its execution is equivalent to the state after the *caller*\ 's BLRS
c29, c8, c9: the link register continues to be a sentry capability back
to the caller. The kernel's derivation and bounding of the exposed
interrupted PCC to exactly the 4-byte span of the BRS instruction means
that sigreturn() can execute this instruction without the signal handler
getting its hands on any PC-relative state from snmalloc's safebox.
Unfortunately, extending the ESTP machinery to cover the entirety of
snmalloc's code is tricky. Therefore, we expect to see the entry
trampoline to exit out from ESTP's cover once it has...
- acquired a stack using an atomic xchg operation with a known offset
within the Allocator structure. This should store NULL; if it reads
NULL, the API is being abused and a nominally single-threaded
Allocator has been invoked by two threads at once.
- arranged for the
`sigfastblock(2) <https://www.freebsd.org/cgi/man.cgi?query=sigfastblock>`__
mechanism to be in effect, deferring async signals until exit from
the safebox.[#sigfastblock-recurse]_
ESTP may be used again, though for ``FAULT`` or ``PREEMPTION`` handling
only, when snmalloc traverses some internal data structures.
``PREEMPTION`` allows snmalloc to cheaply synchronize with the revoker,
ensuring that the ESTP-covered sequence(s) run to completion within a
revocation epoch. It is unlikely that ESTP is useful to exit the
safebox: the exit path can restore the stack, clear all sensitive state,
and then drop the sigfastblock guard against async signals before
returning to the caller.
.. [#sigfastblock-recurse]
sigfastblock is recursive, and so even in the case that snmalloc
enters rtld, it should all work out.
Approximating Conditional Load Instructions
-------------------------------------------
It does not come up often, but there are cases where one is working with
elevated authority on behalf of some other agent and so one may want to
perform a load through capability A only if capability B is also tagged
and permissive of loads. ESTP ``FAULT`` records permit emulation of such a
conditional load instruction: the ``FAULT`` response path can check
capability B and respond appropriately, by continuing execution at the
next instruction or jumping out of the exceptional sequence or, if B is
indeed still tagged, executing a copy of the faulting load instruction
without ESTP guard to commit to the fault (!).
Similarly, one can use ESTP ``FAULT`` records to approximate conditional
load instructions that test the LL reservation rather than the
permissiveness of a second capability. (Modern architectures are
generally forgiving of additional memory operations during LL/SC, so
long as those additional operations avoid the reservation.) This can be
approximated with an ordinary load instruction under an ESTP ``FAULT``
record whose response offset points at code which breaks the reservation
and then resumes execution at the instruction after the load. The result
under-approximates the set of faults that "conditional load" could
deliver, in that it approximates all faults as no-ops due to the
reservation being broken (by explicitly breaking it if needed), but it
may still suffice.
Alternative Proposals and Rebuttals
===================================
In-Band NOPs
------------
An earlier proposal suggested placing NOPs after instructions to encode
what this proposal encodes in the ESTP descriptors. Assessing the
desiderata...
1. This proposal works with modular software.
2. There is no entry or exit cost; exceptional sequences can be inlined
anywhere, given a sufficiently rich taxonomy of NOPs.
3. It's very easy for the fault and preemption paths to test for the
need for special handling (a single copyin of the next instruction).
4. The metadata is overtly intrinsic to the instruction stream.
The advantages of such a scheme over ESTP are...
- Nothing funky going on with pages and alignment
- Very short instruction sequences need very little metadata, possibly
less than an ESTP descriptor!
- Reading a virtual memory location adjacent to PC is faster and easier
than reading per-page metadata.
while the disadvantages are...
- The need for architectural NOPs reserved for software's
introspection. Some architectures, like RISC-V, which have a large
swath of NOPs available as HINTs, may be willing to perform such
carveouts in their ISAs; some, like Arm, seem much less likely to do
so without a very compelling use, and it's not clear that this rises
to that threshold.
- Increased I$ pressure from these NOPs (though in many cases the
impact would be less than or equal to the existing I$ cost of stores
to perform sequence entry and exit, nevermind the additional D$
operations).
Other points of note:
- If too few NOPs are available to encode the entirety of the response,
indirection may be an option: have the in-band NOPs point at
something like an ESTP descriptor (or descriptor table). Even more
extreme forms may be to use a single sigil NOP either
- followed by an unconditional jump over the response descriptor(s)
- to indicate that a full ESTP descriptor table exists at the start
of this page(!) and should be consulted for the instruction in
question.
- This scheme is, like ESTP, viable for both user and kernel code.
- The existing proposal for this design placed the NOPs after the
instruction in question. It may be slightly better to place them
before, as that would not require the kernel to parse out the
instruction width.
Look-Aside Tables for Exceptional Sequences in the Kernel
---------------------------------------------------------
Abandoning the rseq use case for synchronizing user code against
preemption and, instead, focusing only *within the kernel* and on
synchronous faults and asynchronous interrupts, an earlier proposal
generated an out-of-band linkerset for structures that approximate the
ESTP descriptors described herein. Assessing the desiderata...
1. Modularity was achievable but relied on being able to find the
linkerset for each module. This was just shy of straightforward in
the targeted use case of the kernel.
2. There is no entry or exit cost and the sequences can be inlined
anywhere.
3. Because the mechanism focused only on faults, its only impact was in
the very slow-path case where the fault could not be transparently
handled by the kernel. Adding asynchronous event support would have
impacted signal delivery, a somewhat less rare event. The kernel has
other mechanisms to deal with preemption, and so its absence was not
seen as a problem.
4. The lookaside linkerset is reliably associated with the code of
exceptional sequences.
This mechanism probably cannot be extended either to user code or to
include preemption:
- It would be significantly harder in the case of a userspace program
to find the appropriate object's linkerset in the first place; the
kernel has very little idea about "objects" in this sense. The result
would be somewhat akin to Windows SEH, of either the x86 or x64
variety.
- The cost of finding and scanning the linkerset, on every preemption
event especially, seems prohibitive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment