jimblandy · September 21, 2016 01:23
diff --git a/barriers.h b/barriers.h
 /* barriers.h --- interface to memory and code translation barriers  */

 #ifndef MN__GC_BARRIERS_H
 #define MN__GC_BARRIERS_H


 /* The following two functions provide the memory-coherence behavior
   of mutexes, but without any blocking or communication.

   On many modern multi-threaded systems, one thread may see writes to
   memory by another thread occur in an order different than the one
   they were written in.  For example, suppose 'x' and 'y' are global
   variables, both initially zero, shared between two threads A and B.
   Suppose thread A executes the code:

     x = 1;
     y = 2;

   while thread B executes the code:

     y1 = y;
     x1 = x;

   Everyone understands that, since this code doesn't lock or unlock
   any mutexes, or use other synchronization mechanisms, the two
   threads' actions could be interleaved arbitrarily.  For example, we
   have no idea whether B will read x before or after A sets it to 1.

   But what may be surprising is that reads and writes may not even
   occur in the order in which they're written.  For example, suppose
   thread B ends up with y1 set to two:

   - Since y1 == 2, thread A had clearly executed the 'y = 2'
     assignment before thread B executed 'y1 = y'.

   - Thread A should have executed 'y = 2' after 'x = 1', since
     that's the order in which they're written.

   - Thread B should have executed 'x1 = x' after 'y1 = y', since
     that's the order in which they're written.

   - Pulling all that together, we get the ordering:

        'x = 1'; then 'y = 2'; then 'y1 = y'; then 'x1 = x'.

     So x1 must be 1.

   But on many modern systems, x1 may end up set to zero!  Clearly,
   things are not actually happening in the order we wrote them in.
   Who is at fault?  Everybody.


   First, the compiler is free to rearrange most statements if doing
   so wouldn't violate any apparent data dependency.  The compiler
   looking at A's code sees no relationship between the assignment to
   x and the assignment to y, so it can do them in whatever order it
   finds convenient.  The only restrictions on the compiler's freedom
   are:

   - It must not re-order, delete, or add system calls.  (And by
     extension, it must not reorder calls to functions that it can't
     prove won't perform system calls, etc. etc.)

   - It must not re-order, delete, or add reads or writes of volatile
     variables across sequence points (function calls; comma
     operators; etc.)

   But the code doesn't make any system calls, and x and y aren't
   volatile, so the compiler can do as it pleases.

   Another way to put this is that the compiler's reordering need only
   respect data dependencies from a *single thread's* execution of the
   code: it need not consider dependencies that arise from multiple
   threads' interleaved execution.  Our example shows that having
   multiple threads running at a time can reveal rearrangements of
   code that would be invisible in a single-threaded program.

   Similar cases can arise with signal handlers that consult global
   variables: the compiler has the right to exchange the order of two
   stores, even though a signal handler might catch those stores
   happening in the wrong order.


   Second, each processor in an SMP system has an analogous freedom to
   reorder loads and stores to the memory it shares with the other
   processors, if doing so wouldn't break the code that processor is
   running.

   For example, suppose a program loads the value of some byte from
   shared memory into a register.  A processor might actually read and
   cache the full block of 64 bytes (or whatever) containing that
   referenced byte, and then answer loads of other bytes in that block
   directly from its cache, without going to shared memory again.
   From the shared memory's point of view, that processor is doing
   those subsequent loads earlier than they are written in the
   program: the processor has moved the loads earlier.

   Or, going the other direction, a processor might do a store only to
   its cache, and then write the modified words in that cache block
   out to shared memory at some later time.  But it might not do the
   same to other references.  From the shared memory's point of view,
   the processor has moved that store after other memory references.

   The point is, as long as the code running on that processor sees
   all its loads and stores taking effect in the right order *from its
   point of view*, the processor has met its obligations.  If two
   different processors need to see each other's stores in any
   particular order, the code must include some sort of
   architecture-specific memory synchronization instructions.  To
   extend the analogy between processors and compilers, these
   synchronization instructions are the machine-language equivalent of
   'volatile' declarations.


   This freedom to rearrange is a good thing --- it allows the
   compiler to generate much better code, and it allows the processor
   to run much faster than it could otherwise, at the cost of
   complicating the (hopefully) rare cases: signal handlers and
   inter-thread communication.  But it does make writing and reasoning
   about multi-threaded programs harder.  If your threads were only
   allowed to communicate by assigning to variables, the guarantees
   are so weak that inter-thread communication would be practically
   impossible.

   To make things simpler, POSIX says that you can assume that certain
   functions "synchronize memory".  For example, suppose a thread
   acquires a mutex, makes some changes to a data structure, and then
   releases the mutex:

   - The compiler is *not* free to move loads or stores across the
     mutex operations, even though there's no apparent data dependency
     between the mutex operation and whatever tree / hash table /
     cheese danish it protects.

   - Processors are *not* free to move loads back across the mutex
     locking, nor may they move stores forward across the mutex
     release.  [I can't figure out whether those restrictions are
     strong enough.  Do mutex operations need to restrict other sorts
     of reordering, too?]

   This means that, if you always protect data shared between
   different threads by using mutexes in the traditional way, then you
   can program as if all loads and stores will always occur in the
   order you write them.  By writing in the mutex operations, you've
   restricted the actual ordering enough to get the job done.

   The problem is that POSIX defines only a rather small set of
   functions to be memory-synchronizing.  This isn't normally a
   problem --- just use one of the other functions that does
   synchronize memory.  But Minor needs to pause threads with signals
   and then synchronize between the paused threads and the collector,
   and there are no reasonable functions that both synchronize memory
   and are safe to use in signals handlers.  The acts of entering and
   leaving a signal handler are not memory-synchronizing.

   So we define here two operations, to be implemented as appropriate
   for each platform that Minor is ported to, that have the memory-
   synchronizing effects of mutex locking and unlocking, but don't do
   any of the blocking or inter-thread communication.  That is, if you
   can guarantee by other means that one will be executed before the
   other (say, you do one before sending a signal, and you do the
   other in a signal handler), then these will make sure the loads and
   stores to shared memory act that way, too.

   And these functions are async-safe; they may be used within signal
   handlers.

   mn__thread_memory_release () tells the compiler and processor not
   to move loads or stores forward across the call.  By the time a
   thread returns from this function, all stores written before it are
   visible in main memory, and all loads written before it have
   completed.

   mn__thread_memory_acquire () tells the compiler and processor not
   to move loads or stores backward across the call.  All loads
   written after a call to this function will see the effects of
   stores that occurred before it.  All stores written after a call to
   this function will be performed after it returns.

   The general idea is that, if thread A calls mn__thread_memory_release
   before thread B calls mn__thread_memory_acquire, then that has the
   same effect on memory synchronization as A releasing a mutex and
   then B acquiring that mutex.  */

 void mn__thread_memory_release (void);
 void mn__thread_memory_acquire (void);


 /* Ensure that the calling thread will execute the machine code
   currently present in memory, even if it has been modified or newly
   generated.

   If a program writes new executable code to memory, or modifies
   existing code, many architectures don't promise to run such code
   correctly unless the processor first executes some sort of special
   instruction.

   On many processors, this is simply an instruction cache flush:
   their icaches aren't automatically invalidated by writes to the
   cached addresses.  Later IA-32 processors actually internally
   recompile IA-32 machine code on the fly to an internal instruction
   set, and cache the translations in a "trace cache", which needs to
   be flushed if the IA-32 "source" code changes.

   On a multi-processor system, each processor that may execute the
   modified or freshly-generated code must call this function.

   This really isn't a threading primitive at all: it's not required
   to do any inter-processor memory synchronization.  All it's
   required to do is make the processor properly execute the
   instructions from memory that a read would see.  */
 void mn__thread_machine_code_changed (void);


 #endif /* MN__GC_BARRIERS_H */
	/* barriers.h --- interface to memory and code translation barriers */

	#ifndef MN__GC_BARRIERS_H
	#define MN__GC_BARRIERS_H


	/* The following two functions provide the memory-coherence behavior
	of mutexes, but without any blocking or communication.

	On many modern multi-threaded systems, one thread may see writes to
	memory by another thread occur in an order different than the one
	they were written in. For example, suppose 'x' and 'y' are global
	variables, both initially zero, shared between two threads A and B.
	Suppose thread A executes the code:

	x = 1;
	y = 2;

	while thread B executes the code:

	y1 = y;
	x1 = x;

	Everyone understands that, since this code doesn't lock or unlock
	any mutexes, or use other synchronization mechanisms, the two
	threads' actions could be interleaved arbitrarily. For example, we
	have no idea whether B will read x before or after A sets it to 1.

	But what may be surprising is that reads and writes may not even
	occur in the order in which they're written. For example, suppose
	thread B ends up with y1 set to two:

	- Since y1 == 2, thread A had clearly executed the 'y = 2'
	assignment before thread B executed 'y1 = y'.

	- Thread A should have executed 'y = 2' after 'x = 1', since
	that's the order in which they're written.

	- Thread B should have executed 'x1 = x' after 'y1 = y', since
	that's the order in which they're written.

	- Pulling all that together, we get the ordering:

	'x = 1'; then 'y = 2'; then 'y1 = y'; then 'x1 = x'.

	So x1 must be 1.

	But on many modern systems, x1 may end up set to zero! Clearly,
	things are not actually happening in the order we wrote them in.
	Who is at fault? Everybody.


	First, the compiler is free to rearrange most statements if doing
	so wouldn't violate any apparent data dependency. The compiler
	looking at A's code sees no relationship between the assignment to
	x and the assignment to y, so it can do them in whatever order it
	finds convenient. The only restrictions on the compiler's freedom
	are:

	- It must not re-order, delete, or add system calls. (And by
	extension, it must not reorder calls to functions that it can't
	prove won't perform system calls, etc. etc.)

	- It must not re-order, delete, or add reads or writes of volatile
	variables across sequence points (function calls; comma
	operators; etc.)

	But the code doesn't make any system calls, and x and y aren't
	volatile, so the compiler can do as it pleases.

	Another way to put this is that the compiler's reordering need only
	respect data dependencies from a single thread's execution of the
	code: it need not consider dependencies that arise from multiple
	threads' interleaved execution. Our example shows that having
	multiple threads running at a time can reveal rearrangements of
	code that would be invisible in a single-threaded program.

	Similar cases can arise with signal handlers that consult global
	variables: the compiler has the right to exchange the order of two
	stores, even though a signal handler might catch those stores
	happening in the wrong order.


	Second, each processor in an SMP system has an analogous freedom to
	reorder loads and stores to the memory it shares with the other
	processors, if doing so wouldn't break the code that processor is
	running.

	For example, suppose a program loads the value of some byte from
	shared memory into a register. A processor might actually read and
	cache the full block of 64 bytes (or whatever) containing that
	referenced byte, and then answer loads of other bytes in that block
	directly from its cache, without going to shared memory again.
	From the shared memory's point of view, that processor is doing
	those subsequent loads earlier than they are written in the
	program: the processor has moved the loads earlier.

	Or, going the other direction, a processor might do a store only to
	its cache, and then write the modified words in that cache block
	out to shared memory at some later time. But it might not do the
	same to other references. From the shared memory's point of view,
	the processor has moved that store after other memory references.

	The point is, as long as the code running on that processor sees
	all its loads and stores taking effect in the right order *from its
	point of view*, the processor has met its obligations. If two
	different processors need to see each other's stores in any
	particular order, the code must include some sort of
	architecture-specific memory synchronization instructions. To
	extend the analogy between processors and compilers, these
	synchronization instructions are the machine-language equivalent of
	'volatile' declarations.


	This freedom to rearrange is a good thing --- it allows the
	compiler to generate much better code, and it allows the processor
	to run much faster than it could otherwise, at the cost of
	complicating the (hopefully) rare cases: signal handlers and
	inter-thread communication. But it does make writing and reasoning
	about multi-threaded programs harder. If your threads were only
	allowed to communicate by assigning to variables, the guarantees
	are so weak that inter-thread communication would be practically
	impossible.

	To make things simpler, POSIX says that you can assume that certain
	functions "synchronize memory". For example, suppose a thread
	acquires a mutex, makes some changes to a data structure, and then
	releases the mutex:

	- The compiler is not free to move loads or stores across the
	mutex operations, even though there's no apparent data dependency
	between the mutex operation and whatever tree / hash table /
	cheese danish it protects.

	- Processors are not free to move loads back across the mutex
	locking, nor may they move stores forward across the mutex
	release. [I can't figure out whether those restrictions are
	strong enough. Do mutex operations need to restrict other sorts
	of reordering, too?]

	This means that, if you always protect data shared between
	different threads by using mutexes in the traditional way, then you
	can program as if all loads and stores will always occur in the
	order you write them. By writing in the mutex operations, you've
	restricted the actual ordering enough to get the job done.

	The problem is that POSIX defines only a rather small set of
	functions to be memory-synchronizing. This isn't normally a
	problem --- just use one of the other functions that does
	synchronize memory. But Minor needs to pause threads with signals
	and then synchronize between the paused threads and the collector,
	and there are no reasonable functions that both synchronize memory
	and are safe to use in signals handlers. The acts of entering and
	leaving a signal handler are not memory-synchronizing.

	So we define here two operations, to be implemented as appropriate
	for each platform that Minor is ported to, that have the memory-
	synchronizing effects of mutex locking and unlocking, but don't do
	any of the blocking or inter-thread communication. That is, if you
	can guarantee by other means that one will be executed before the
	other (say, you do one before sending a signal, and you do the
	other in a signal handler), then these will make sure the loads and
	stores to shared memory act that way, too.

	And these functions are async-safe; they may be used within signal
	handlers.

	mn__thread_memory_release () tells the compiler and processor not
	to move loads or stores forward across the call. By the time a
	thread returns from this function, all stores written before it are
	visible in main memory, and all loads written before it have
	completed.

	mn__thread_memory_acquire () tells the compiler and processor not
	to move loads or stores backward across the call. All loads
	written after a call to this function will see the effects of
	stores that occurred before it. All stores written after a call to
	this function will be performed after it returns.

	The general idea is that, if thread A calls mn__thread_memory_release
	before thread B calls mn__thread_memory_acquire, then that has the
	same effect on memory synchronization as A releasing a mutex and
	then B acquiring that mutex. */

	void mn__thread_memory_release (void);
	void mn__thread_memory_acquire (void);


	/* Ensure that the calling thread will execute the machine code
	currently present in memory, even if it has been modified or newly
	generated.

	If a program writes new executable code to memory, or modifies
	existing code, many architectures don't promise to run such code
	correctly unless the processor first executes some sort of special
	instruction.

	On many processors, this is simply an instruction cache flush:
	their icaches aren't automatically invalidated by writes to the
	cached addresses. Later IA-32 processors actually internally
	recompile IA-32 machine code on the fly to an internal instruction
	set, and cache the translations in a "trace cache", which needs to
	be flushed if the IA-32 "source" code changes.

	On a multi-processor system, each processor that may execute the
	modified or freshly-generated code must call this function.

	This really isn't a threading primitive at all: it's not required
	to do any inter-processor memory synchronization. All it's
	required to do is make the processor properly execute the
	instructions from memory that a read would see. */
	void mn__thread_machine_code_changed (void);


	#endif /* MN__GC_BARRIERS_H */