I want to speed up integer overflow checking for hardening purposes by keeping a sticky overflow flag and only trapping when necessary. I want to keep it super simple while hopefully giving the optimizers room to do their thing.
In the codegen part of clang:
- each function gets an i1 for storing overflow information, initialized to 0
- each integer overflow check ORs its result into the overflow flag
- before each function call, return instruction, or other side-effecting operation, execude ud2 if overflow is set
Reasonable?
There is one primary problem I see here, at least on x86: you have to put*that i1 somewhere, and that will be too expensive.
The thing is, we have the overflow bit in a flag that gets set automatically. But getting that bit out of a flag register and stored somewhere else will be dramatically more expensive than branching to a trap instruction. Here is an example for Haswell using the Intel Architecture Code Analyzer:
This assumes there is some magical memory at a constant offset of %ebp (I didn't spend long enough to get a real use of say the redzone on x86-64, but it seems representative. Look at the number of micro-ops: 3. At least one of those three gets micro-op fusion applied (the
^
indicates that). But still, this is crazy expensive.We could make it less expensive by using a register operand, but now we have to occupy an entire register for this! Not good.
But look at how fast branching to some ud2 instruction elsewhere is:
The branch gets completely fused away. x86 processors at least are just ridiculously good at this.
What might be useful is a CPU feature which would provide for a saturating overflow flag register. This would allow you to dramatically reduce the number of branches, and it seems very easy to implement in hardware. We should start lobbying with Intel to get this.