Last active
September 22, 2021 14:16
-
-
Save vext01/c7a95428fad7cab58201636116431055 to your computer and use it in GitHub Desktop.
Blocking Optimsations
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
TODO: convert to markdown. | |
Consider the scenario where we want to benchmark some code: | |
``` | |
start_time = time(); | |
bench_output = run_bench(bench_inputs); | |
result = time() - start_time; | |
``` | |
The problem is that since there is no dependency between `time()` and the data | |
being passed in and out of `run_bench()`, the compiler may try to re-order the | |
code to this: | |
``` | |
bench_output = run_bench(bench_inputs); | |
start_time = time(); | |
result = time() - start_time; | |
``` | |
Or this: | |
``` | |
start_time = time(); | |
result = time() - start_time; | |
bench_output = run_bench(bench_inputs); | |
``` | |
Or even this: | |
``` | |
start_time = time(); | |
result = time() - start_time; | |
``` | |
(since we didn't do anything side-effecting with `bench_output`) | |
There are therefore two problems. We need to: | |
1) Stop the benchmark disappearing altogether. | |
2) Ensure the benchmark stays anchored between the two calls to `time()`. | |
There have been many historical incarnations of inline assembler code which | |
could be used to tackle this. | |
Let's look at how Google does this when benchmarking C++ code [gbench]. | |
First, there's `ClobberMemory()` [clobber]. It's definition (when using clang) is: | |
``` | |
inline BENCHMARK_ALWAYS_INLINE void ClobberMemory() { | |
asm volatile("" : : : "memory"); | |
} | |
``` | |
What does this do? Well first, let's remind ourself of GNU extended inline asm | |
syntax [asm]: | |
``` | |
asm asm-qualifiers ( AssemblerTemplate | |
: OutputOperands | |
[ : InputOperands | |
[ : Clobbers ] ]) | |
``` | |
First it's `volatile` [volatile]. This means that the compiler may not | |
"optimise" the contents of the asm block or remove it altogether. It must | |
include it verbatim. In this example it prevents the obviously empty asm block | |
from being deleted (`AssemblerTemplate` is empty). | |
Second, it clobbers memory [asm-clobber]. This means that the compiler must | |
assume that each and every address in the virtual address space is *both* read | |
and mutated when the asm block is executed. Of course in reality no memory | |
address is read or mutated, but the compiler now doesn't know that. | |
How could this help? Let's sprinkle some clobbers into our benchmark harness: | |
``` | |
1 start_time = time(); | |
2 ClobberMemory(); | |
3 bench_output = run_bench(bench_inputs); | |
4 ClobberMemory(); | |
5 result = time() - start_time; | |
``` | |
Assuming that `bench_inputs` is stored in memory (i.e. stack allocated, not in | |
a register), the compiler must now assume that the first `ClobberMemory()` may | |
mutate it. And since `run_bench(bench_inputs)` is expected to observe this | |
"mutation", this places an ordering on the computations. Line 2 must come | |
before line 1. The compiler may not reorder those. | |
Similarly, assuming `start_time` is stack allocated, since the first call to | |
`ClobberMemory()` may read `start_time`, this places an ordering on the | |
computation of `start_time` and the call to `ClobberMemory()` on line 2. The | |
compiler may not reorder lines 1 and 2. | |
So what we've learned from the last two paragraphs is that, assuming that | |
`start_time` and `bench_inputs` are in memory, the computations on lines 1, 2 | |
and 3 must appear precisely in that order. In other words, the benchmark may | |
not be moved up beyond the call to `time()`. | |
We apply the same reasoning to understand why the second `ClobberMemory()` | |
prevents and code moving below the second call to `time()`. In sum, these two | |
anchorings ensure the benchmark remains inside the critical section. | |
XXX: The above argument relies on the idea that clobbering memory both writes | |
*and reads* to/from every memory address. The GNU asm docs confirms this | |
[clobber]. Without this condition I don't see how line 3 is anchored below line | |
1 in the example. There is an ongoing discussion on a stackoverflow thread | |
where someone asked (independently) exactly this [so-comment], but there is no | |
mention of clobber *reading* all memory. Not clear... | |
But wait. This assumes our variables are stored in memory. What if they are | |
promoted into registers? No amount of memory clobbering can fool the optimiser | |
in this case. | |
This leads us to `DoNotOptimise` [donotopt] which is defined like this (for | |
clang): | |
``` | |
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) { | |
asm volatile("" : "+r,m"(value) : : "memory"); | |
} | |
``` | |
What's different? First it takes a value by (C++) reference. This means that a | |
pointer to `value` is passed. You can't make a pointer to a register, so this | |
forces `value` to be stack allocated. By kicking the value to the stack we now | |
ensure that memory clobbering *will* affect the value. | |
The `+r,m` output constraint is new: | |
- `r,m` tells the compiler that the pointer itself may live in memory or in a | |
register and that the pointer itself (not the memory it points to) may be | |
mutated. | |
- the `+` tells the compiler that the pointer itself may also be read. | |
Note that `DoNotOptimise()` still clobbers memory. | |
So our benchmark is better written as: | |
``` | |
start_time = time(); | |
DoNotOptimise(bench_inputs); | |
bench_output = run_bench(bench_inputs); | |
DoNotOptimise(bench_output); | |
result = time() - start_time; | |
``` | |
XXX: In this example it's not clear to me why `run_bench()` cannot be put | |
before the first call to `time()`. Assume `start_time` is allocated in a | |
register. Then `DoNotOptimise()` clobbers `bench_inputs`, what it points to, | |
and all other memory, but *not* start_time... I have a feeling this relates to | |
my first XXX. | |
# When to use DoNotOptimise and when to use `ClobberMemory`? | |
XXX TODO | |
References: | |
[so-comment] https://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c/56865717#comment71928341_38025837 | |
[clobber] https://github.com/google/benchmark/blob/e451e50e9b8af453f076dec10bd6890847f1624e/include/benchmark/benchmark.h#L358-L368 | |
[donotopt] https://github.com/google/benchmark/blob/e451e50e9b8af453f076dec10bd6890847f1624e/include/benchmark/benchmark.h#L349-L356 | |
[gbench] https://github.com/google/benchmark/blob/e451e50e9b8af453f076dec10bd6890847f1624e/include/benchmark/benchmark.h#L339-L368 | |
[asm] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html | |
[out] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#OutputOperands | |
[volatile] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile | |
[asm-clobber] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers | |
Further reading: | |
https://www.youtube.com/watch?v=nXaxk27zwlk | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment