Skip to content

Instantly share code, notes, and snippets.

@jimblandy
Created September 21, 2016 01:23
Show Gist options
  • Save jimblandy/ab80fd54029fa5302fa8807e270ce03d to your computer and use it in GitHub Desktop.
Save jimblandy/ab80fd54029fa5302fa8807e270ce03d to your computer and use it in GitHub Desktop.
Header file from Minor Scheme for memory barriers
/* barriers.h --- interface to memory and code translation barriers */
#ifndef MN__GC_BARRIERS_H
#define MN__GC_BARRIERS_H
/* The following two functions provide the memory-coherence behavior
of mutexes, but without any blocking or communication.
On many modern multi-threaded systems, one thread may see writes to
memory by another thread occur in an order different than the one
they were written in. For example, suppose 'x' and 'y' are global
variables, both initially zero, shared between two threads A and B.
Suppose thread A executes the code:
x = 1;
y = 2;
while thread B executes the code:
y1 = y;
x1 = x;
Everyone understands that, since this code doesn't lock or unlock
any mutexes, or use other synchronization mechanisms, the two
threads' actions could be interleaved arbitrarily. For example, we
have no idea whether B will read x before or after A sets it to 1.
But what may be surprising is that reads and writes may not even
occur in the order in which they're written. For example, suppose
thread B ends up with y1 set to two:
- Since y1 == 2, thread A had clearly executed the 'y = 2'
assignment before thread B executed 'y1 = y'.
- Thread A should have executed 'y = 2' after 'x = 1', since
that's the order in which they're written.
- Thread B should have executed 'x1 = x' after 'y1 = y', since
that's the order in which they're written.
- Pulling all that together, we get the ordering:
'x = 1'; then 'y = 2'; then 'y1 = y'; then 'x1 = x'.
So x1 must be 1.
But on many modern systems, x1 may end up set to zero! Clearly,
things are not actually happening in the order we wrote them in.
Who is at fault? Everybody.
First, the compiler is free to rearrange most statements if doing
so wouldn't violate any apparent data dependency. The compiler
looking at A's code sees no relationship between the assignment to
x and the assignment to y, so it can do them in whatever order it
finds convenient. The only restrictions on the compiler's freedom
are:
- It must not re-order, delete, or add system calls. (And by
extension, it must not reorder calls to functions that it can't
prove won't perform system calls, etc. etc.)
- It must not re-order, delete, or add reads or writes of volatile
variables across sequence points (function calls; comma
operators; etc.)
But the code doesn't make any system calls, and x and y aren't
volatile, so the compiler can do as it pleases.
Another way to put this is that the compiler's reordering need only
respect data dependencies from a *single thread's* execution of the
code: it need not consider dependencies that arise from multiple
threads' interleaved execution. Our example shows that having
multiple threads running at a time can reveal rearrangements of
code that would be invisible in a single-threaded program.
Similar cases can arise with signal handlers that consult global
variables: the compiler has the right to exchange the order of two
stores, even though a signal handler might catch those stores
happening in the wrong order.
Second, each processor in an SMP system has an analogous freedom to
reorder loads and stores to the memory it shares with the other
processors, if doing so wouldn't break the code that processor is
running.
For example, suppose a program loads the value of some byte from
shared memory into a register. A processor might actually read and
cache the full block of 64 bytes (or whatever) containing that
referenced byte, and then answer loads of other bytes in that block
directly from its cache, without going to shared memory again.
From the shared memory's point of view, that processor is doing
those subsequent loads earlier than they are written in the
program: the processor has moved the loads earlier.
Or, going the other direction, a processor might do a store only to
its cache, and then write the modified words in that cache block
out to shared memory at some later time. But it might not do the
same to other references. From the shared memory's point of view,
the processor has moved that store after other memory references.
The point is, as long as the code running on that processor sees
all its loads and stores taking effect in the right order *from its
point of view*, the processor has met its obligations. If two
different processors need to see each other's stores in any
particular order, the code must include some sort of
architecture-specific memory synchronization instructions. To
extend the analogy between processors and compilers, these
synchronization instructions are the machine-language equivalent of
'volatile' declarations.
This freedom to rearrange is a good thing --- it allows the
compiler to generate much better code, and it allows the processor
to run much faster than it could otherwise, at the cost of
complicating the (hopefully) rare cases: signal handlers and
inter-thread communication. But it does make writing and reasoning
about multi-threaded programs harder. If your threads were only
allowed to communicate by assigning to variables, the guarantees
are so weak that inter-thread communication would be practically
impossible.
To make things simpler, POSIX says that you can assume that certain
functions "synchronize memory". For example, suppose a thread
acquires a mutex, makes some changes to a data structure, and then
releases the mutex:
- The compiler is *not* free to move loads or stores across the
mutex operations, even though there's no apparent data dependency
between the mutex operation and whatever tree / hash table /
cheese danish it protects.
- Processors are *not* free to move loads back across the mutex
locking, nor may they move stores forward across the mutex
release. [I can't figure out whether those restrictions are
strong enough. Do mutex operations need to restrict other sorts
of reordering, too?]
This means that, if you always protect data shared between
different threads by using mutexes in the traditional way, then you
can program as if all loads and stores will always occur in the
order you write them. By writing in the mutex operations, you've
restricted the actual ordering enough to get the job done.
The problem is that POSIX defines only a rather small set of
functions to be memory-synchronizing. This isn't normally a
problem --- just use one of the other functions that does
synchronize memory. But Minor needs to pause threads with signals
and then synchronize between the paused threads and the collector,
and there are no reasonable functions that both synchronize memory
and are safe to use in signals handlers. The acts of entering and
leaving a signal handler are not memory-synchronizing.
So we define here two operations, to be implemented as appropriate
for each platform that Minor is ported to, that have the memory-
synchronizing effects of mutex locking and unlocking, but don't do
any of the blocking or inter-thread communication. That is, if you
can guarantee by other means that one will be executed before the
other (say, you do one before sending a signal, and you do the
other in a signal handler), then these will make sure the loads and
stores to shared memory act that way, too.
And these functions are async-safe; they may be used within signal
handlers.
mn__thread_memory_release () tells the compiler and processor not
to move loads or stores forward across the call. By the time a
thread returns from this function, all stores written before it are
visible in main memory, and all loads written before it have
completed.
mn__thread_memory_acquire () tells the compiler and processor not
to move loads or stores backward across the call. All loads
written after a call to this function will see the effects of
stores that occurred before it. All stores written after a call to
this function will be performed after it returns.
The general idea is that, if thread A calls mn__thread_memory_release
before thread B calls mn__thread_memory_acquire, then that has the
same effect on memory synchronization as A releasing a mutex and
then B acquiring that mutex. */
void mn__thread_memory_release (void);
void mn__thread_memory_acquire (void);
/* Ensure that the calling thread will execute the machine code
currently present in memory, even if it has been modified or newly
generated.
If a program writes new executable code to memory, or modifies
existing code, many architectures don't promise to run such code
correctly unless the processor first executes some sort of special
instruction.
On many processors, this is simply an instruction cache flush:
their icaches aren't automatically invalidated by writes to the
cached addresses. Later IA-32 processors actually internally
recompile IA-32 machine code on the fly to an internal instruction
set, and cache the translations in a "trace cache", which needs to
be flushed if the IA-32 "source" code changes.
On a multi-processor system, each processor that may execute the
modified or freshly-generated code must call this function.
This really isn't a threading primitive at all: it's not required
to do any inter-processor memory synchronization. All it's
required to do is make the processor properly execute the
instructions from memory that a read would see. */
void mn__thread_machine_code_changed (void);
#endif /* MN__GC_BARRIERS_H */
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment