-
Star
(137)
You must be signed in to star a gist -
Fork
(17)
You must be signed in to fork a gist
-
-
Save rygorous/e0f055bfb74e3d5f0af20690759de5a7 to your computer and use it in GitHub Desktop.
Why do compilers even bother with exploiting undefinedness signed overflow? And what are those | |
mysterious cases where it helps? | |
A lot of people (myself included) are against transforms that aggressively exploit undefined behavior, but | |
I think it's useful to know what compiler writers are accomplishing by this. | |
TL;DR: C doesn't work very well if int!=register width, but (for backwards compat) int is 32-bit on all | |
major 64-bit targets, and this causes quite hairy problems for code generation and optimization in some | |
fairly common cases. The signed overflow UB exploitation is an attempt to work around this. | |
To be more precise: the #1 transform that compilers really want to get out of this is to replace int32 | |
loop counters with int64 loop counters in 64-bit code whenever possible, because int32s are bad news | |
for the purposes of memory access and address generation. | |
To show the actual problem, let's look at a simple loop: | |
int x[N]; // initialized elsewhere | |
sum = 0; | |
for (int i = 0; i < count; ++i) | |
sum += x[i]; | |
which can be naively compiled (on x86-64) into something like: | |
; ecx = count | |
; rsi = points to x[] | |
xor eax, eax ; clear sum | |
test ecx, ecx | |
jng done ; don't even enter loop if count <= 0 | |
xor ebx, ebx ; i | |
lp: | |
movsxd rdx, ebx ; * sign-extend i to 64 bits | |
add eax, [rsi+rdx*4] ; sum += x[i] | |
inc ebx ; ++i | |
cmp ebx, ecx | |
jl lp | |
done: | |
That MOVSXD (marked *) is indicative of the issue: we're on a 64-bit machine trying to use a 32-bit index to | |
refer to an array, and because addresses are 64-bit (and we could absolutely have an array larger than 4GB but | |
with less than 2 billion entries), we need to sign-extend that index to 64-bit before we can do our addressing | |
calculation. | |
That sign-extend is extra work in the inner loop that would not exist in the 32-bit version of this snippet. | |
It also causes other problems. For example, the 32-bit equivalent of the code above always accesses memory at | |
esi + ebx*4 | |
and ebx is what's called an "induction variable" - it changes by a constant value in every loop iteration | |
(incrementing by 1 in this case). A 32-bit compiler can realize this and replace it with pointer arithmetic | |
(incrementing esi by 4 every iter, say). "esi" and "ebx" are both 32-bit values subject to 32-bit wraparound, | |
but since we're working modulo 2^32 anyway, this is not a problem. | |
A 64-bit compiler needs to work with the less friendly address | |
rsi + sext32to64(ebx)*4 | |
with a sign extend in the middle (that "sext32to64(ebx)" is what the "movsxd rbx, edx" does). And in | |
particular, should ebx ever overflow from 0x7fffffff to -0x80000000, the resulting address will change | |
dramatically, by (4 - 2^(32+2)). In 32-bit mode (all addresses modulo 2^32), this is just an increase by 4 | |
as usual; the "wraparound" part is a multiple of 2^32 and hence invisible. In 64-bit mode, it isn't; if the | |
compiler wants to turn this into address arithmetic, it needs to either: | |
a) prove that "i" can never overflow/wrap around in this loop, | |
b) insert extra code that detects wrap-around of "i" and adjusts the addresses accordingly, | |
c) not use pointer arithmetic and recompute addresses every time (as in the example). | |
Option a) is relatively straightforward in this particular example, but it can get hairy, and my point is | |
that 32-bit compilers never need to worry about any of this to begin with (and neither do 64-bit compilers | |
when dealing with 64-bit integer indices). | |
And if the compiler *isn't* able to do these proofs, it can get pretty bad. For example, the loop above | |
is basically all loop overhead, and most compilers will end up unrolling (or vectorizing, but I'll | |
ignore that) it. Now we'd prefer to see an inner loop like: (I won't show the setup or tail loop here): | |
lp: | |
add eax, [rsi+0] ; sum += x[i+0] | |
add eax, [rsi+4] ; sum += x[i+1] | |
add eax, [rsi+8] ; sum += x[i+2] | |
add eax, [rsi+12] ; sum += x[i+3] | |
add rsi, 16 ; ptr += 4 | |
dec ecx ; (trip count in ecx, was calculated outside loop) | |
jnz lp | |
(only showing the adressing parts here, in practice there's other transforms we'd like to see too, but | |
let's stay focused.) | |
If the compiler can't (for whatever reason) prove that the index expressions are not going to overflow | |
as 32-bit values, it needs to do something much worse like: | |
lp: | |
movsxd rdi, ebx ; sext32to64(i) | |
add eax, [rsi+rdi*4] | |
lea edi, [ebx+1] ; i+1 | |
movsxd rdi, edi | |
add eax, [rsi+rdi*4] | |
lea edi, [ebx+2] | |
movsxd rdi, edi | |
add eax, [rsi+rdi*4] | |
lea edi, [ebx+3] | |
movsxd rdi, edi | |
add eax, [rsi+rdi*4] | |
add ebx, 4 | |
cmp ebx, ecx ; (precomputed spot to stop 4x unrolled iters) | |
jl lp | |
This is a silly example, but all you need to do is look at some assembly listings for | |
generated x64 code for hot loops with two's complement overflow semantics and search for | |
"movsxd"; usually you'll be able to find actual addressing code that does stuff like this | |
in less than a minute. | |
Anyway, as said, compilers use the signed-overflow UB as a free ticket to promote int32 | |
loop counters to int64, which fixes this type of problem (and introduces exciting new | |
problems, especially when it's applied to all signed integer arihtmetic and not cases | |
like these loop counters where there's actually a specific reason). | |
But really the core issue is that all of C's semantics (everything narrower auto-widened | |
to int etc.) work pretty well when "int" is the width of machine registers and pretty | |
badly when it's not. Newer languages (e.g. Go and Rust) solve this problem by just making | |
"int" be 64-bit on 64-bit platforms, which removes most of the motivation for this kind | |
of thing. | |
What I'd *like* to see compilers spend time on is: | |
a) you can do profile-guided optimizations; what about profile-guided performance warnings? | |
Eating the sign extends (and resulting extra work) on cold code is not *that* bad if | |
there's a way for the compiler to tell the programmer about the hot loops that really | |
should probably use a different loop counter type. | |
b) decreasing cost of sign extends. Note that my examples allocate a different register | |
for the sign-extended quantities, increasing register pressure. This is typical and | |
often as much (or more) of a problem than the extra instructions. On x64, it is in | |
fact fine and often preferable to sign-extend a 32-bit register to itself: | |
movsxd rbx, ebx | |
or similar. The top 32 bits of "rbx" will get zero-cleared the next time a 32-bit ALU | |
op writes to ebx, but that's fine. At least we didn't use an extra register. | |
c) compilers already expend a lot of effort on theorem prover-style analysis that can | |
e.g. show when overflows can't happen in a loop such as the example loop (in this | |
case: if we can prove "count" stays fixed throughout the loop, we only enter the loop | |
if "0 < count", and we only stay in the loop until "i >= count"; with i increasing at | |
a rate of 1 per iteration, this is guaranteed to happen before i overflows). This is | |
a powerful tool for enabling better optimizations (like transitioning to address | |
arithmetic in this loop). Unfortunately it's also pretty much a black box to the | |
programmer, and it can be really hard to understand what the compiler can or cannot | |
prove at any given point in time. So it's tricky. Better analysis catches more and | |
more cases, but when it doesn't work, it fails badly. | |
If in doubt, it's almost always preferable to have semantics defined in such a way | |
that you don't *need* a theorem-prover to generate efficient code. (Easier said | |
than done in some cases!) | |
Anyway, what you can do? | |
If you're a compiler writer: | |
- Don't copy C's backwards-compat compromise that caused this mess. Make sure that your | |
idiomatic loop counter is register width. | |
Basically, the less common these sign extends are on critical paths, the less pressure | |
there is to do questionable things to make them faster. | |
If you're someone writing C/C++ code: It depends on the platform. On x86-64, 32-bit | |
arithmetic ops are all zero-extending. So anything that works with uint32s is fine. | |
ARM AArch64 likewise has 32- and 64-bit variants of everything, *and* some fairly good | |
sign/zero-extend support for narrower types built in, even in addresssing calculations. | |
This helps reduce these costs. On 64-bit PowerPC/POWER, you only get 64-bit versions | |
of many arithmetic operations, and your life sucks no matter what. You really want to | |
be working with 64-bit integer types whenever possible on these machines. | |
On most of the machines you're likely to use, "size_t" for loop counters is a good idea | |
where signed values aren't required. It's unsigned and generally as wide as addresses, | |
so there's normally no extra code for zero/sign extends, and the type is standard. |
I really enjoyed this, thanks. Do you have pointers to any sort of write-up about why C chose to go the "leave int
at 32 bits" route? (And why wasn't the same done for the 16- to 32-bit transition?)
@jacobsa For the 16 to 32-bit transition, changing from 16-bit to 32-bit made sense; it only adds two bytes per variable, and quite a lot of counters in typical programs can very easily exceed 65536 in the first place (i.e. there were already lots of programs that either had arbitrary limits or actual brokenness, and those that didn't already had to use 32-bit values all over the place anyway). So the effect of the change was minimal. Furthermore, not all of the "16-bit" platforms were really 16-bit in the first place; the MC68K-based machines were really 32-bit from the get go, and so even on the 68000 and 68010-based systems, some compilers use 32-bit "int" by default anyway (the main downside, until the 68020, being that there was a performance penalty, particularly if multiplication was involved — but since programmers knew that, they could always declare things as "short int" if they knew values would fit in that range).
For the 32-bit to 64-bit transition, things were rather different. Most values in most programs will happily fit in 32 bits (i.e. they are always going to be smaller than ±2bn or +4bn respectively), so extending them all to 64-bits wastes four bytes each time. This creates extra pressure on processor caches and on system memory and has very little benefit otherwise.
As an aside, Cray went for 64-bit int, as did early Alpha-based machines (though later ones switched to 32-bit int).
One other issue not mentioned in the above write-up is that loop analysis is easier if you are allowed to assume that signed overflow can’t happen, which means that compilers can more aggressively optimize some loops under this assumption. This applies even on platforms (like ARMv8) that are able to cope easily with mixed-size arithmetic.
@jacobsa There's also an issue that there are only two integer types below "int" in standard C, if you want to support all power-of-two sized integers with an 8-bit minimum then you're forced to use a 32-bit "int" or add a non-standard type to fill in the gap:
- signed/unsigned char: 8-bit
- short: 16/32-bit ??
- int: 64-bit
- long: 128-bit? Also 64-bit?
A 32-bit int just makes more sense under that constraint.
@jacobsa a doc about the actual decision makers thought process is here: http://www.unix.org/version2/whatsnew/lp64_wp.html
("LP64" is the name for the mode everyone uses, because long and ptr types are 64-bit (and nothing else.))
There's also an issue that there are only two integer types below "int" in standard C, if you want to support all power-of-two sized integers with an 8-bit minimum then you're forced to use a 32-bit "int" or add a non-standard type to fill in the gap
That's not been the case for a good 15 years, a C99-compliant implementation can provide any number of _intNt for whichever N it wants, usually N=8, 16, 32 and 64 but e.g. Clang also provides types for N=24, N=40, N=48 and N=56.
Also worth noting that if the 32b instruction results were sign-extended to 64b instead of zero-extended, this particular thing wouldn't be a problem (there'd be a different, but somewhat less frequent problem).
"LP64" is the name for the mode everyone uses
Doesn't Windows use LLP64, where long long
and pointers are 64-bit, but long
is not?
because long and ptr types are 64-bit (and nothing else.)
In both LP64 and LLP64, long long
is 64-bit.
Could you convert to Markdown (basically change the extension) so this is readable on mobile?
You may want to look at Chris Lattner's "What Every C Programmer Should Know About Undefined Behavior" series (starting at http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html)
I'd say that the optimizations that become possible if we exploit the existence of UB go far beyond the case you've shown. In fact those are architecture-agnostic and don't need to know about the machine word size on the target arch.
I was searching for this today and thought it was on your blog, ended up finding the gist. May I suggest that it would be a good blog post!
To be pedantic: C didn't go the "int is 32 bits" route. Many compiler writers did. Others didn't, you can find real implementations where they just went with "int is 64 bits", and those also work quite nicely. And C99 and later allows the vendor to provide int32_t without any requirement that it be one of short/int/long.
Thanks for this little writeup, it was a good read!
Really small correction though, simply because you mentioned Go and Rust: Rust doesn't have an
int
type; it has expliciti32
/u32
andi64
/u64
types. It does, however, have ausize
type that is "The pointer-sized unsigned integer type" which is almost always used for indexing and loop counters (and ofc alsoisize
, its signed counterpart). This aligns with what you're saying here and is generally a Good Thing™, especially the "Make sure that your idiomatic loop counter is register width" part.