Forked from nwf-msr/Exceptional Sequence Text Pages
Created
September 17, 2024 20:21
-
-
Save nwf/cbd790433ddaa3319060efc23c73874b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
############################### | |
Exceptional-Sequence Text Pages | |
############################### | |
Introduction | |
============ | |
Occasionally, software wants to impose interesting out-of-band | |
constraints on small, "exceptional" sequences of instructions. Examples | |
of such constraints include needing to run... | |
- to completion without preemption: if thread scheduling takes this | |
sequence off-core, the sequence should be restarted, rather than | |
continued, upon scheduling resuming this thread. | |
- without asynchronous interruption, more generally than preemption, by | |
either restarting the sequence (as with preemption) or by *catching | |
the interruption* with specialized response. | |
- with specialized responses to *synchronous* faults. | |
Such sequences are generally short, hand-written in assembler, and have | |
few basic blocks. In general, the set of such sequences is static per | |
loadable object, but it is not necessarily constant across a *process's* | |
lifetime. It seems likely that, if there are more than zero such | |
sequences within a library or process, there will also be more than one. | |
Desiderata | |
========== | |
1. Multiple pieces of software, unknown to each other, within one | |
process should be able to avail themselves of such exceptional | |
sequences. This extends to dynamically loaded (or perhaps even JIT | |
generated) code. | |
2. These sequences are generally very short, perhaps as short as a | |
single instruction, and so entry and exit from exceptional sequences | |
should be as close as possible to 0-instruction overhead and, | |
especially, should avoid mutating memory. (The brevity of these | |
sequences means that we do not expect them to nest or to include | |
calls to other functions.) | |
3. Context switch happens very often and is already generally considered | |
"too expensive" as it stands, so testing *whether* the currently | |
executing code needs special response for preemption should be as | |
fast as possible, particularly in the (most common) case that the | |
answer is "no." | |
This is less of a concern for non-preemption events, such as | |
synchronous faults that cannot be transparently handled or | |
asynchronous interrupts which redirect control flow, as these events | |
are rare and, typically, already have fairly high response costs. | |
4. The exceptionalness of these sequences is intrinsic to the sequence | |
of instructions; it should not be possible to sometimes run the | |
instructions as part of an exceptional sequence and sometimes not. | |
This is perhaps merely aesthetic. | |
Existing Approaches | |
=================== | |
Idempotent Sequences, Historically | |
---------------------------------- | |
Apparently first introduced as "restartable atomic sequences" by | |
`Bershad et al. Fast Mutual Exclusion for Uniprocessors (1992) | |
<https://dl.acm.org/doi/pdf/10.1145/143371.143523>`__, these are regions | |
of code that run to completion entirely on one side or the other of | |
preemption. The two implementations discussed in that paper are Mach's, | |
which permits an address space to have exactly one restartable atomic | |
sequence, and Taos's, which permits multiple sequences but recognizes | |
them by parsing the instruction stream. | |
On quite old ARM processors, a popular trick was to use idempotent | |
sequence machinery to implement atomic operations. See, for example, | |
`Linux's documentation of their implementation | |
<https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt>`__. | |
Existing Special Handling within the FreeBSD Kernel | |
--------------------------------------------------- | |
Within the FreeBSD kernel, there are two mechanisms used for specialized | |
subsets of our aim here. | |
"pcb_onfault" | |
~~~~~~~~~~~~~ | |
Each thread control block contains a "``pcb_onfault``" field. If this is | |
not ``NULL``, the processor is running in the kernel, and the system | |
encounters a nonresolvable fault, control flow will be redirected to | |
this address. This mechanism is not used to synchronize with preemption | |
or for asynchronous events. That aside, considering our desiderata, | |
1. This mechanism supports modular software. | |
2. This mechanism requires two stores per executed exceptional sequence, | |
and in many cases those sequences are shorter than this combined | |
entry and exit cost. In other cases, the entry and exit costs are | |
amortized and include many instructions that either cannot fault or | |
for which a fault should indeed panic the machine. | |
3. This mechanism does not interact with context switch. It interacts | |
only with the nonresolvable fault path, on which it performs a load, | |
a conditional branch, and a possible store, making it quite | |
lightweight. | |
4. It is, in principle, possible to bypass the entry instructions for an | |
exceptional sequence, but it is hard to imagine what utility would be | |
gained. | |
Hard-coded Program Counter Address Comparisons | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
In the platform support code for arm64, the low-level "external abort" | |
handling machinery involves a function named "``test_bs_fault``" that | |
explicitly compares the faulting program counter to the address of eight | |
symbols from within the kernel. | |
1. This is patently non-modular, requiring a priori knowledge of all | |
code needing special handling. | |
2. Entry and exit are, indeed, free, and these symbols, once added to | |
the set considered, can be located anywhere in the kernel. | |
3. This mechanism, similarly, does not interact with context switch. It | |
sits entirely on a very rare path used generally while probing | |
hardware and not at all in the steady state. | |
4. The special handling is tied to the instruction's location and so | |
cannot be severed. | |
Linux's rseq API | |
---------------- | |
The kernel code is available in | |
- `linux/uapi/rseq.h <https://github.com/torvalds/linux/blob/master/include/uapi/linux/rseq.h>`__ | |
describes the userspace interface | |
- `rseq.c <https://github.com/torvalds/linux/blob/master/kernel/rseq.c>`__ | |
contains the bulk of the associated kernel code | |
- `membarrier.c <https://github.com/torvalds/linux/blob/master/kernel/sched/membarrier.c>`__ | |
also has grown support for restarting rseqs, either generally or on a | |
specific CPU. | |
See `The 5-year journey to bring restartable sequences to Linux - | |
EfficiOS <https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/>`__ | |
and `Restartable sequences restarted | |
[LWN.net] <https://lwn.net/Articles/697979/>`__ for writeups that cover | |
some older editions of the API, as well. | |
Each thread has at most one "``struct rseq``" associated with it, though | |
repeated registrations are *counted*.[#rseq-sign]_ Each "``struct | |
rseq``" points to at most one "``struct rseq_cs``" which describes the | |
code region for the restartable sequence, giving its start, end, and | |
abort locations in the executable. Software must store into the ``struct | |
rseq_cs`` pointer within its registered ``struct rseq`` before entering | |
each restartable sequence and should perform an additional store to | |
clear the pointer after exiting a restartable sequence. | |
The kernel consults the registered ``struct rseq`` only on preemption, | |
migration, and signal delivery. Flags within the ``struct rseq`` (not ``struct | |
rseq_cs``!) can disable the control-flow transfer response of each event | |
source; this feature of the API is intended for use by debuggers. | |
Speaking of debuggers, ptrace has been extended to support reading the | |
current rseq registration. | |
.. [#rseq-sign] | |
Registrations are also "signed" with a 32-bit value provided to the | |
system call that must also be provided at un-registration and must | |
equal the 32-bit value prior to the rseq's abort address. Diagnostic | |
messages in the Linux kernel suggest that this is an attempt to | |
prevent attackers from clobbering an active ``struct req_cs`` | |
pointer. | |
Checking Desiderata | |
~~~~~~~~~~~~~~~~~~~ | |
1. By slightly extending the *ABI*, so that there is a well-known name | |
for *the* "``struct rseq``" managed by TLS machinery, multiple | |
disparate pieces of software can use this mechanism. However, | |
changing the kernel's notion of registered "``struct rseq``" is an | |
ambient authority (a system call) guarded by a 32-bit number that | |
must also be shared between the disparate pieces of software, rather | |
like the "secret" address space layout randomization (ASLR) slide. | |
2. This API requires one store on the entry to a rseq-guarded sequence: | |
the pointer to the "``struct rseq_cs``" must be stored into the | |
"``struct rseq``". Because the "``struct rseq_cs``" describes not | |
just the restart address but also the start and end of the sequence, | |
it is not strictly necessary to clear this pointer at the end of the | |
sequence, but nevertheless the API documentation suggests doing so. | |
Even one store is more than we would like to pay, but it is | |
apparently tolerable. | |
3. With this API, preemption now must | |
a. Test whether the thread has a "``struct rseq``" associated with it | |
and, if so, | |
b. Load the "``struct rseq_cs``" pointer, | |
c. Update the event counters within the "``struct rseq``", | |
d. If there is a "``struct rseq_cs``" indicated, load it, | |
e. Check whether the preempted address is within the indicated | |
bounds, and, if so | |
f. Replace the preempted address with the indicated restart address. | |
This is fairly lightweight: a few loads, a store, and a few | |
conditional branches. | |
4. It is, in principle, possible to bypass the store of the "``struct | |
rseq_cs``" pointer and still execute the instructions of the | |
sequence, disassociating them from their intended rseq use, but, | |
again, it is not clear what advantage this would give. | |
FreeBSD's rseq proposal | |
----------------------- | |
See `D32505 Add rseq(2) <https://reviews.freebsd.org/D32505>`__. It is a | |
direct port of the Linux API. | |
Windows' Structured Exception Handling (SEH) | |
-------------------------------------------- | |
Windows SEH is useful for managing synchronous faults and may be useful | |
for asynchronous interrupts in principle, but it is far too heavy weight | |
for interacting with preemption a la rseq. In any case, at least as of | |
x64, it uses lookaside tables to drive its response and seems to require | |
ahead of time registration of loaded code segments | |
(`RtAddFunctionTable <https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-rtladdfunctiontable>`__) | |
or at least callbacks that can generate those tables on demand | |
(`RtInstallFunctionTableCallback <https://docs.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-rtlinstallfunctiontablecallback>`__). | |
As faults are (probably correctly) presumed rare, the expense of all the | |
table walks was seen as preferable to x32's much more stack-based | |
mechanism. | |
x32 SEH used[#win32-seh]_ an ABI-defined offset (specifically, 0) within the | |
per-thread control block reachable via the %fs segment selector to point, in | |
turn, at a chain of exception handler registration records. These registration | |
records were dynamically constructed on the stack and were pushed and popped | |
from the chain. Given the rarity of exceptions, it is understandable that this | |
runtime cost was seen as too high. | |
.. [#win32-seh] | |
At least, as far as I understand `A Crash Course on the Depths of | |
Win32 Structured Exception | |
Handling <http://bytepointer.com/resources/pietrek_crash_course_depths_of_win32_seh.htm>`__. | |
New Proposal | |
============ | |
To enable both O(1) runtime costs per event *and* a dynamic multiplicity | |
of idempotent sequences within a program, I propose to introduce | |
**Exceptional-Sequence Text Pages** (ESTP).[#estp-name]_ Such a page | |
begins with a structured data header (defined below) and then contains | |
the instructions of these exceptional sequences. | |
.. [#estp-name] | |
I'm more than open to suggestions for better names. | |
Header Layout | |
------------- | |
Each such page (the size of pages being defined by the platform ABI) | |
begins with a descriptor table, which is composed of an ABI-defined | |
metadata word followed by an array of ESTP descriptors. The metadata | |
word includes the number of descriptors on this | |
page.[#estp-header-limit]_ | |
Each descriptor is itself an ABI-defined structure, but is likely a | |
64-bit value. The fields of a descriptor are as follows. (Bit-widths | |
given in parentheses are suggestions for ABIs with paging granularity of | |
up to 64KiB bytes; for 4KiB page architectures, the top four bits of | |
each offset would also be zero.) | |
+-------------+------------+-------------+-------------+-------------+ | |
| Event | Zeros (13) | Start | End offset | Response | | |
| selector | | offset (16) | (16) | offset (16) | | |
| (3) | | | | | | |
+-------------+------------+-------------+-------------+-------------+ | |
The start (resp. end) offset points to the first byte of the first | |
instruction within (resp. after) the sequence on this page; responses | |
again are offsets that point to instructions with the current page. | |
Because it may enable optimization in the kernel and is easy to do while | |
constructing the ESTP page, the array of descriptors must be sorted | |
lexicographically using the field order shown. | |
The event selector is one of | |
+---------------------+-----------------------------------------------+ | |
| Selector (value) | Description | | |
+=====================+===============================================+ | |
| ``ESTP_PREEMPTION`` | Involuntary preemption (time quantum expired | | |
| (0) | or thread synchronization) | | |
+---------------------+-----------------------------------------------+ | |
| ``ESTP_INTERRUPT`` | Attempted asynchronous interrupt / signal | | |
| (1) | delivery | | |
+---------------------+-----------------------------------------------+ | |
| ``ESTP_FAULT`` | Synchronous fault that cannot be | | |
| (2) | transparently resolved | | |
+---------------------+-----------------------------------------------+ | |
When the table is scanned, the first descriptor that matches a given | |
(PC, event) pair, if any, will be the one chosen for further | |
processing.[#estp-span-overlap]_ | |
.. [#estp-header-limit] | |
And we should perhaps insist that there is some sensible upper limit | |
on the number of descriptors per page; a 4KiB page can, strictly, fit | |
512 descriptors of the suggested 64-bit format, but that leaves no | |
room for code. | |
.. [#estp-span-overlap] | |
Span overlap is likely best kept to equal spans and differing event | |
selectors, but this is not required. | |
Aside: An Alternative Table Layout | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
It is also possible that we could have the header word describe the | |
number of entries for each selector. This would remove the selector | |
from the descriptors themselves and would slightly optimize the case | |
that there were no descriptors of a particular event. On the other | |
hand, it limits the number of events we can use as triggers to fairly | |
few; perhaps there are only a few we want, though, so this is the right | |
idea. | |
Events and Responses | |
-------------------- | |
The various event selectors request transfer of control flow | |
thus:[#estp-cheri-monotonicity]_ | |
- For a ``PREEMPTION`` descriptor, control flow will move to the | |
response offset upon resumption following a preemption event or | |
similar *invisible* interrupt.[#estp-preemption-visibility]_ No other | |
changes to the register file will be made. | |
It is not required, but one would expect it to be common, that the | |
response offset is equal to the start offset. In this case, the | |
exceptional code sequence must behave in an *idempotent* manner; | |
typically, the last instruction within the sequence will be a store to | |
"commit" the aggregate effect of the sequence. | |
``PREEMPTION`` can be useful for accessing per-CPU data structures using | |
such idempotent sequences, as cross-core migration necessarily | |
requires preemption. | |
Presently, we **do not** propose to check the user instruction pointer | |
on preemption events that happen inside system calls (just as we do | |
not check return addresses within the user program). That is, | |
sequences associated with ``PREEMPTION`` descriptors should not make | |
system calls. | |
- For a ``FAULT`` descriptor, control flow moves to the response offset | |
if any instruction within the associated sequence triggers a | |
synchronous fault. No other changes to the register file are made; if | |
the identity of the faulting instruction matters, multiple descriptors | |
and response offsets should be used. | |
``FAULT`` descriptors are intended for constructing idempotent | |
sequences that involve memory loads from unsynchronized memory. If a | |
sequence's commitment point can "retroactively" validate that these | |
loads have produced a sensible value, the lack of synchronization may | |
not be a concern. | |
- For an ``INTERRUPT`` descriptor, control flow moves to the response | |
offset whenever the system *resumes* from a *visible*, asynchronous | |
interrupt (such as a UNIX signal with a signal handler | |
registered)[#estp-signal-visibility]_ while the thread's program | |
counter was within the sequence spanned by this | |
descriptor.[#estp-interrupt-address]_ | |
The word(s) immediately prior to the response offset includes a | |
bitmask of registers to clear before the system *invokes* the | |
asynchronous interrupt handler as well as the length of the response | |
code.[#estp-interrupt-length]_ Tentatively, a suggested layout for a | |
system with 32 registers could be | |
+------------+------------------------------------+-----------------------------+ | |
| Zeros (16) | Response text length in bytes (16) | Register clear bitmask (32) | | |
+------------+------------------------------------+-----------------------------+ | |
The intended use case for ``INTERRUPT`` descriptors is to be able to | |
front-run the system's usual signaling mechanism, to allow, for | |
example, a lightweight *safebox* to expunge sensitive contents from | |
its register file before execution exits the safebox. It is unlikely | |
that this mechanism is sufficient for sufficiently elaborate | |
safeboxes, but it should allow the safebox to reliably engage | |
additional interrupt-deferring mechanisms to cover the remainder of | |
its execution.[#morello-safebox-interrupts]_ | |
.. [#estp-cheri-monotonicity] | |
On CHERI architectures, to ensure monotonicity of execution, this | |
transfer of control flow must be done by deriving a capability from | |
within the current program counter capability. | |
.. [#estp-preemption-visibility] | |
Invisible interrupts include broadcast synchronization events like | |
CheriBSD's thread_single() mechanism which is used, for example, by | |
the Cornucopia implementation of CHERI revocation. Thus, | |
``PREEMPTION`` descriptors may also be used to ensure that the | |
exceptional sequence completes, after some number of restarts, | |
entirely within one revocation epoch. | |
.. [#estp-signal-visibility] | |
On UNIX, one can treat SIG_DFL signals as invisible -- they generally | |
either perform no action or kill the program and in any case do not | |
expose the thread's register file -- and even "old API" signal | |
handlers do not, by a strict reading of POSIX, gain access to the | |
registers, either. | |
.. [#estp-interrupt-address] | |
That is, the interrupt/signal/... handler will receive the response | |
address as the address of the interrupted instruction rather than the | |
address of the truly interrupted instruction within the exceptional | |
sequence. | |
.. [#estp-interrupt-length] | |
On CHERI, the length of the response code will be used when | |
constructing the PCC exposed to the interrupt handler: the PCC | |
exposed to the handler will have its address set to the indicated | |
response address and its length set to the indicated length, thereby | |
losing access to PC-relative data outside this region. | |
.. [#morello-safebox-interrupts] | |
On CHERI, when entering a safebox using, for example, Morello's BLR, | |
BLRR, BLRS, BR, BRS, LDPBLR, LDPBR, RET, RETR, or RETS instructions, | |
the unsealed PCC (and possibly IDC) are *immediately* available in | |
the register file and subject to capture by asynchronous event | |
handlers (for example, within the ucontext_t structures passed to | |
CheriBSD signal handlers). The domain-call ABI should ensure that the | |
operands of these instructions are preserved by their action so that | |
the sealed forms are available within ``INTERRUPT`` responses to | |
facilitate retrying domain entry. | |
Registering an ESTP With the Kernel | |
----------------------------------- | |
We wish there were general per-virtual-page state in the FreeBSD VM | |
subsystem, as objects look to be too coarse and physical pages are | |
shared across address spaces. For the moment, then, we propose to use a | |
per-address-space hash set of registered ESTP pages.[#estp-why-hash]_ | |
Two new ``madvise()`` behavior values can insert | |
(``MADV_ESTP_REGISTER``) or remove (``MADV_ESTP_UNREGISETER``) one or | |
more page(s) from this set.[#estp-why-madvise]_ | |
User software using ``ESTP`` must arrange to call ``madvise()`` | |
``MADV_ESTP_REGISTER`` on the appropriate virtual pages before executing | |
any exceptional-sequence code. This is likely readily done with the | |
existing ELF ``ctor`` machinery. One could imagine also teaching the | |
image activator about ESTP pages, but this seems excessive. | |
.. [#estp-why-hash] | |
Earlier versions of this proposal used a flag within a ``vm_page_t`` | |
structure to designate ESTP pages, but this necessitated unsharing | |
text pages across address spaces, which is far from ideal. | |
.. [#estp-why-madvise] | |
We propose the use of ``madvise()`` rather than ``mprotect()`` or | |
``mmap()`` to avoid chewing up bits in flags fields. The argument to | |
``madvise()`` is a selector rather than a composite, and so we feel | |
more comfortable reserving entries in that namespace. | |
Kernel Changes | |
~~~~~~~~~~~~~~ | |
Upon an event of interest to the ESTP machinery (preemption, signal | |
delivery, or unhandled fault), the kernel should look up the user | |
program counter's page in this set. If found, the kernel must traverse | |
the descriptors at the front of the ESTP page; it first loads the 0th | |
word of the page to read the ESTP descriptor table metadata, and then | |
scans the array of descriptors. We suspect it will be simplest to use | |
the ``fu[e]\*`` functions to load each descriptor in a loop rather than | |
wire the page. We are requiring the ESTP descriptors to be sorted in a | |
way that is amenable to binary search, but this is unlikely to hugely | |
matter and may in fact be detrimental given the expected small size of | |
the array. | |
Using ESTPs Within the Kernel | |
----------------------------- | |
This mechanism may also be of interest to the kernel itself. The various | |
``copy{in,out}`` and ``f[su]\.*`` functions may be placed on an ESTP and | |
so avoid the need to register and unregister themselves with the | |
``pcb_onfault`` mechanism. The O(1) lookup in the fault path is also | |
significantly nicer than the proposed "fault lookaside table" of `this | |
earlier pull request | |
<https://github.com/CTSRD-CHERI/cheribsd/pull/1159>`__. | |
Care would be required to ensure that ESTPs are not used within the code | |
resolving ESTPs or in code that holds the same locks, especially if | |
``PREEMPTION`` descriptors are desired. However, the kernel has other, | |
well-trodden mechanisms for interacting with its own preemptive | |
scheduler, and it has little use for ``INTERRUPT`` descriptors' | |
responses, and so in-kernel ESTPs are likely sensibly limited to | |
``FAULT`` descriptors. At least the FreeBSD kernel is already | |
architected to avoid triggering non-resolvable faults while accessing | |
its own internal data structures, so the fault-handling code is | |
generally free to acquire the pmap locks required for ESTP resolution. | |
As an aside, if ``PREEMPTION`` descriptors do, in fact, turn out to be | |
desirable in the kernel, then the kernel should manipulate its own ESTP | |
hash table set with preemption disabled. That way, the preemptive | |
switcher can read from this set without what would amount to recursive | |
access. | |
Checking Desiderata | |
------------------- | |
1. ESTP works with disparate software within the same process: each | |
consumer of ESTP provides its own page(s). Registration with the | |
kernel need be done only at load time, and deregistration is implicit | |
at unload time, when the pages are unmapped. | |
2. There is no instruction overhead for entering or exiting an ESTP | |
sequence, though there is cost in branching to and from. See below. | |
3. Preemption must test whether the page holding the preempted program | |
counter is an ESTP and, if so, whether there is a ``PREEMPTION`` | |
descriptor covering that address. This is O(1), if with a larger than | |
desirable constant; see below. | |
4. It is not possible to execute instructions on an ESTP page without | |
also becoming subject to the ESTP descriptors on that page. | |
Reasons All is Not Obviously Well | |
--------------------------------- | |
The use of page-aligned descriptor tables complicates code generation | |
and linkage, probably limiting ESTPs to hand-written assembler with | |
implied entry and exit costs of branching to and from these exceptional | |
sequences. It is possible to have code not covered by ESTP descriptors | |
within an ESTP page, so it may be possible to amortize the branch costs | |
by moving tight loops around the exceptional sequence, but again, this | |
relies on hand-written assembler. (On the other hand, in some use cases | |
which ESTP would replace, the existing approaches also use hand-written | |
assembler and incur these branch costs.) | |
This drawback stems from the central role played by pages in this | |
design. Fundamentally, we are exploiting page *metadata* to recognize | |
exceptional sequences and page *alignment* as a convenient map from | |
program counters to the set of possible governing descriptors. It is | |
possible that superior recognition and mapping strategies exist, and in | |
some sense the proposal here is severable into its notion of descriptors | |
tightly coupled to particular code sequences and the particular layout | |
of memory that performs said coupling. | |
Worked Examples | |
=============== | |
General Interlocks With Preemption | |
---------------------------------- | |
One can use ``PREEMPTION`` ESTP descriptors to cover the existing rseq | |
use cases, and the only real challenges are... | |
- the necessity of moving all restartable sequence(s) onto one or more | |
dedicated pages, together with their ESTP descriptors and | |
- registering the ESTP page(s) with the kernel sufficiently early in | |
program startup, which can be done with ctors for most code or | |
special-purpose code in rtld should it use ESTPs in startup. | |
These do not seem insurmountable. | |
.. _snmalloc-pagemap-race: | |
Revoker-interlocked Type Stability in snmalloc PageMap | |
------------------------------------------------------ | |
Racing double-frees in snmalloc pose an especially interesting | |
challenge, since the *type* of data in the PageMap's MetaEntry for a | |
given region may change once the owning thread (which is uniquely | |
allowed to mutate the MetaEntry) has learned that there are no *live* | |
objects within that range. In the case of a racing double-free, this | |
type change may happen even while another thread is accessing the | |
MetaEntry. If a chain of pointers from the MetaEntry must be followed | |
early in deallocation (say, to access the associated arena or shadow | |
capabilities), then this leaves racing deallocations by threads that do | |
not own the region in question open to UAR of the allocator's metadata. | |
The race can almost be closed, modulo a small residual TOCTTOU gap, by | |
ensuring that these type shifts occur only after revocation has expunged | |
all capabilities into a range before permitting differently typed reuse | |
of its MetaEntry and having deallocation check the revokedness of its | |
argument. The TOCTTOU gap may then be closed by use of an idempotent | |
sequence consisting of this check and the chained loads guarded under a | |
``PREEMPTION`` ESTP descriptor, as the beginning of a revocation epoch | |
will act like a preemption event. | |
snmalloc Safebox Entry on Morello | |
--------------------------------- | |
Let's assume that we control both sides of the function entry. In | |
practice, we will want to (or have to) ascribe to the domain-crossing | |
ABI, but for the moment there isn't one, so we can pretend. | |
On the one hand, snmalloc is an ideal stress test of this machinery | |
since it's going to exercise basically everything. On the other hand, | |
it's a little tough because either we're going to have to teach the | |
compiler how to generate these exceptional sequences or we're going to | |
have to (re)write a whole lot of what is currently a very pleasantly | |
portable C++ codebase into assembler. | |
A caller that considers snmalloc trusted will... | |
1. fetch its sealed Allocator this pointer from TLS into c8, | |
2. fetch its sealed code pointer to, say, dealloc(), into c9, and | |
3. use BLRS (pair) to jump and unseal both (BLRS c29, c8, c9). | |
BLRS will store back a sentry to the link register (c30), unseal c8 into | |
c29, and unseal c9 into pc. (Despite the surface syntax, c29 is a fixed | |
choice of the Morello architecture, like c30, and is not a selectable | |
operand of BLRS). Importantly, c8 and c9 are not clobbered and are | |
already defined by the aarch64 ABI as caller-save.[#safebox-abi-regs]_ | |
.. [#safebox-abi-regs] | |
It may make more sense to use c16 and c17, reserved by the ABI for | |
call veneers? | |
Immediately upon entry to the snmalloc safebox, we must be defensive | |
against both faults and interrupts (async signals), as either event | |
risks exposing the unsealed interior (but preemption is not a concern). | |
If we are careful about the program we write, we can decide that faults | |
are impossible and should be fatal, bringing down the entire process. | |
However, async signals do not offer such a convenient story. This early | |
in the entry, we simply want async signals to have happened *before* | |
domain transition rather than after, and so we need to arrange to rewind | |
time. | |
Our entry-point must fit all its storage requirements into the | |
caller-saved registers (other than c8 and c9, so c10 through c15) and/or | |
the intra-procedure scratch registers (c16 and c17). Its ``INTERRUPT`` ESTP | |
descriptor must direct the kernel to clear c10 through c17 and c29, to erase | |
its unsealed capabilities and any progeny thereof, and must point the | |
response at a single (4-byte) instruction: BRS c29, c8, c9. Note that | |
this instruction does not alter c30, so the state of the machine after | |
its execution is equivalent to the state after the *caller*\ 's BLRS | |
c29, c8, c9: the link register continues to be a sentry capability back | |
to the caller. The kernel's derivation and bounding of the exposed | |
interrupted PCC to exactly the 4-byte span of the BRS instruction means | |
that sigreturn() can execute this instruction without the signal handler | |
getting its hands on any PC-relative state from snmalloc's safebox. | |
Unfortunately, extending the ESTP machinery to cover the entirety of | |
snmalloc's code is tricky. Therefore, we expect to see the entry | |
trampoline to exit out from ESTP's cover once it has... | |
- acquired a stack using an atomic xchg operation with a known offset | |
within the Allocator structure. This should store NULL; if it reads | |
NULL, the API is being abused and a nominally single-threaded | |
Allocator has been invoked by two threads at once. | |
- arranged for the | |
`sigfastblock(2) <https://www.freebsd.org/cgi/man.cgi?query=sigfastblock>`__ | |
mechanism to be in effect, deferring async signals until exit from | |
the safebox.[#sigfastblock-recurse]_ | |
ESTP may be used again, though for ``FAULT`` or ``PREEMPTION`` handling | |
only, when snmalloc traverses some internal data structures. | |
``PREEMPTION`` allows snmalloc to cheaply synchronize with the revoker, | |
ensuring that the ESTP-covered sequence(s) run to completion within a | |
revocation epoch. It is unlikely that ESTP is useful to exit the | |
safebox: the exit path can restore the stack, clear all sensitive state, | |
and then drop the sigfastblock guard against async signals before | |
returning to the caller. | |
.. [#sigfastblock-recurse] | |
sigfastblock is recursive, and so even in the case that snmalloc | |
enters rtld, it should all work out. | |
Approximating Conditional Load Instructions | |
------------------------------------------- | |
It does not come up often, but there are cases where one is working with | |
elevated authority on behalf of some other agent and so one may want to | |
perform a load through capability A only if capability B is also tagged | |
and permissive of loads. ESTP ``FAULT`` records permit emulation of such a | |
conditional load instruction: the ``FAULT`` response path can check | |
capability B and respond appropriately, by continuing execution at the | |
next instruction or jumping out of the exceptional sequence or, if B is | |
indeed still tagged, executing a copy of the faulting load instruction | |
without ESTP guard to commit to the fault (!). | |
Similarly, one can use ESTP ``FAULT`` records to approximate conditional | |
load instructions that test the LL reservation rather than the | |
permissiveness of a second capability. (Modern architectures are | |
generally forgiving of additional memory operations during LL/SC, so | |
long as those additional operations avoid the reservation.) This can be | |
approximated with an ordinary load instruction under an ESTP ``FAULT`` | |
record whose response offset points at code which breaks the reservation | |
and then resumes execution at the instruction after the load. The result | |
under-approximates the set of faults that "conditional load" could | |
deliver, in that it approximates all faults as no-ops due to the | |
reservation being broken (by explicitly breaking it if needed), but it | |
may still suffice. | |
Alternative Proposals and Rebuttals | |
=================================== | |
In-Band NOPs | |
------------ | |
An earlier proposal suggested placing NOPs after instructions to encode | |
what this proposal encodes in the ESTP descriptors. Assessing the | |
desiderata... | |
1. This proposal works with modular software. | |
2. There is no entry or exit cost; exceptional sequences can be inlined | |
anywhere, given a sufficiently rich taxonomy of NOPs. | |
3. It's very easy for the fault and preemption paths to test for the | |
need for special handling (a single copyin of the next instruction). | |
4. The metadata is overtly intrinsic to the instruction stream. | |
The advantages of such a scheme over ESTP are... | |
- Nothing funky going on with pages and alignment | |
- Very short instruction sequences need very little metadata, possibly | |
less than an ESTP descriptor! | |
- Reading a virtual memory location adjacent to PC is faster and easier | |
than reading per-page metadata. | |
while the disadvantages are... | |
- The need for architectural NOPs reserved for software's | |
introspection. Some architectures, like RISC-V, which have a large | |
swath of NOPs available as HINTs, may be willing to perform such | |
carveouts in their ISAs; some, like Arm, seem much less likely to do | |
so without a very compelling use, and it's not clear that this rises | |
to that threshold. | |
- Increased I$ pressure from these NOPs (though in many cases the | |
impact would be less than or equal to the existing I$ cost of stores | |
to perform sequence entry and exit, nevermind the additional D$ | |
operations). | |
Other points of note: | |
- If too few NOPs are available to encode the entirety of the response, | |
indirection may be an option: have the in-band NOPs point at | |
something like an ESTP descriptor (or descriptor table). Even more | |
extreme forms may be to use a single sigil NOP either | |
- followed by an unconditional jump over the response descriptor(s) | |
- to indicate that a full ESTP descriptor table exists at the start | |
of this page(!) and should be consulted for the instruction in | |
question. | |
- This scheme is, like ESTP, viable for both user and kernel code. | |
- The existing proposal for this design placed the NOPs after the | |
instruction in question. It may be slightly better to place them | |
before, as that would not require the kernel to parse out the | |
instruction width. | |
Look-Aside Tables for Exceptional Sequences in the Kernel | |
--------------------------------------------------------- | |
Abandoning the rseq use case for synchronizing user code against | |
preemption and, instead, focusing only *within the kernel* and on | |
synchronous faults and asynchronous interrupts, an earlier proposal | |
generated an out-of-band linkerset for structures that approximate the | |
ESTP descriptors described herein. Assessing the desiderata... | |
1. Modularity was achievable but relied on being able to find the | |
linkerset for each module. This was just shy of straightforward in | |
the targeted use case of the kernel. | |
2. There is no entry or exit cost and the sequences can be inlined | |
anywhere. | |
3. Because the mechanism focused only on faults, its only impact was in | |
the very slow-path case where the fault could not be transparently | |
handled by the kernel. Adding asynchronous event support would have | |
impacted signal delivery, a somewhat less rare event. The kernel has | |
other mechanisms to deal with preemption, and so its absence was not | |
seen as a problem. | |
4. The lookaside linkerset is reliably associated with the code of | |
exceptional sequences. | |
This mechanism probably cannot be extended either to user code or to | |
include preemption: | |
- It would be significantly harder in the case of a userspace program | |
to find the appropriate object's linkerset in the first place; the | |
kernel has very little idea about "objects" in this sense. The result | |
would be somewhat akin to Windows SEH, of either the x86 or x64 | |
variety. | |
- The cost of finding and scanning the linkerset, on every preemption | |
event especially, seems prohibitive. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment