Created
February 17, 2017 00:41
-
-
Save rygorous/fdd41f45b24472649aaeb5b55bbe6e26 to your computer and use it in GitHub Desktop.
Note on changes to the box pruning code.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Brief explanation what I did to get the speed-up, and the thought process behind it. | |
The original code went: | |
EnterLoop: | |
movaps xmm3, xmmword ptr [edx+ecx*2] // Box1YZ | |
cmpnltps xmm3, xmm2 | |
movmskps eax, xmm3 | |
cmp eax, 0Ch | |
jne NoOverlap | |
<code to report overlap (this runs rarely)> | |
NoOverlap: | |
add ecx, 8 | |
comiss xmm1, xmmword ptr [esi+ecx] // [esi] = BoxListX[Index1].mMinX, compared to MaxLimit | |
jae EnterLoop | |
My first suggestion was to restructure the loop slightly so the hot "no overlap" path is | |
straight-line and the cold "found overlap" path has the extra jumps. This can help | |
instruction fetch behavior, although in this case it didn't make a difference. | |
Nevertheless, I'll do it here because it makes things easier to follow: | |
EnterLoop: | |
movaps xmm3, xmmword ptr [edx+ecx*2] // Box1YZ | |
cmpnltps xmm3, xmm2 | |
movmskps eax, xmm3 | |
cmp eax, 0Ch | |
je FoundOverlap | |
ResumeAfterOverlap: | |
add ecx, 8 | |
comiss xmm1, xmmword ptr [esi+ecx] // [esi] = BoxListX[Index1].mMinX, compared to MaxLimit | |
jae EnterLoop | |
Alright, so that's a nice, sweet, simple loop. Now a lot of people will tell you that | |
out-of-order cores are hard to optimize for since they're "unpredictable" or "fuzzy" | |
or whatever. I disagree: optimizing for out-of-order cores is *easy* and far less | |
tedious than say manual scheduling for in-order machines is. It's true that for OoO, | |
you can't just give a fixed "clock cycles per iteration" number, but the same thing | |
is already true for *anything* with a cache, so who are we kidding? The reality of the | |
situation is that while predicting the exact flow uops are gonna take through the | |
machine is hard (and also fairly pointless unless you're trying to build an exact | |
pipeline simulator), quantifying the overall statistical behavior of loops on OoO | |
cores is often easier than it is for in-order machines. Because for nice simple loops | |
like this, it boils down to operation counting - total number of instructions, and | |
total amount of work going to different types of functional units. We don't need | |
to worry about scheduling; the cores can take care of that themselves, and the | |
loop above has no tricky data dependencies between iterations (the only inter-iteration | |
change is the "add ecx, 8", which doesn't depend on anything else in the loop) so | |
everything is gonna work fine on that front. So, on to the counting. I'm counting | |
two things here: 1. "fused domain" uops (to a first-order approximation, this means | |
"instructions as broken down by the CPU front-end") and 2. un-fused uops going to | |
specific groups of functional units ("ports"), which is what the CPU back-end deals | |
with. When I write "unfused p0", I mean an unfused uop that has to go to port 0. | |
"unfused 1 p23" is an unfused uop that can go to ports 2 or 3 (whichever happens to | |
be free). I'm using stats for the i7-2600K in my machine (Intel Sandy Bridge); newer | |
CPUs have slightly different (but still similar) breakdowns. Now without further ado, | |
we have: | |
EnterLoop: | |
movaps xmm3, xmmword ptr [edx+ecx*2] // 1 fused uop=unfused 1 p23 | |
cmpnltps xmm3, xmm2 // 1 fused uop=unfused 1 p1 | |
movmskps eax, xmm3 // 1 fused uop=unfused 1 p0 | |
cmp eax, 0Ch // \ | |
je FoundOverlap // / cmp+je=1 fused uop=unfused 1 p5 | |
ResumeAfterOverlap: | |
add ecx, 8 // 1 fused uop=unfused 1 p015 | |
comiss xmm1, xmmword ptr [esi+ecx] // 2 fused uops=unfused 1 p0 + 1 p1 + 1 p23 | |
jae EnterLoop // 1 fused uop=unfused 1 p5 | |
(yes, the pair of x86 instructions cmp+je combines into one fused uop.) | |
Fused uops are the currency the CPU frontend deals in. It can process at most 4 | |
of these per cycle, under ideal conditions, although in practice (for various | |
reasons) it's generally hard to average much more than 3 fused uops/cycle unless | |
the loop is relatively short (which, luckily, this one is). All the ports can | |
accept one instruction per cycle. | |
So total, we have: | |
- 8 fused uops -> at 4/cycle, at least 2 cycles/iter | |
- 7 uops going to ports 0,1,5 | |
- 2 port 0 only | |
- 2 port 1 only | |
- 2 port 5 only | |
- 1 "wildcard" that can go anywhere | |
-> at least 2.33 cycles/iter from port 0,1,5 pressure | |
- 2 uops going to ports 2,3 -> at 2/cycle, at least 1 cycles/iter | |
And of that total, the actual box pruning test (the first 5 x86 instructions) | |
are 4 fused uops, 3 unfused p015 and 1 unfused p23 - a single cycle's worth | |
of work. In other words, we spend more than half of our execution bandwidth | |
on loop overhead. That's no good. | |
Hence, unroll 4x. With that, provided there *are* at least 4 boxes to test | |
against in the current cluster, we end up with: | |
- 4*4 + 4 = 20 fused uops -> at 4/cycle, at least 5 cycles/iter | |
- 4*3 + 4 = 16 uops going to ports 0,1,5 | |
- 4*1+1=5 p0 | |
- 4*1+1=5 p1 | |
- 4*1+1=5 p2 | |
- 1 "wildcard" | |
-> at least 5.33 cycles/iter from port 0,1,5 pressure | |
- 5 uops to port 2,3 -> at 2/cycle, at least 2.5 cycles/iter | |
Our bottleneck are once again ports 0,1,5, but they now process 4 candidate | |
pairs in 5.33 cycles worth of work, whereas they took 9.33 cycles worth of | |
work before. So from that analysis, we expect something like a 42.8% | |
reduction in execution time, theoretical. Actual observed reduction was | |
34.4% on my home i7-2600K (12038 Kcyc -> 7893 Kcyc) and 42.9% on my | |
work i7-3770S (8990 Kcyc -> 5131 Kcyc). So the Sandy Bridge i7-2600K also | |
runs into some other limits not accounted for in this (very simplistic!) | |
analysis whereas the i7-3770S behaves *exactly* as predicted. | |
The other tweak I tried later was to switch things around so the box X | |
coordinates are converted to integers. The issue is our 2-fused-uop COMISS, | |
which we'd like to replace with a 1-fused-uop compare. Not only is the | |
integer version fewer uops, the CMP uop is also p015 instead of the more | |
constrained p0+p1 for COMISS. | |
What would we expect from that? Our new totals are: | |
- 4*4 + 4 = 20 fused uops -> at 4/cycle, at least 5 cycles/iter | |
- 4*3 + 3 = 15 uops going to ports 0,1,5 | |
- 4*1+1=4 p0 | |
- 4*1+1=4 p1 | |
- 4*1+1=5 p2 | |
- 2 "wildcard" -> note these can go to p0/p1, and now we're perfectly balanced! | |
-> at least 5 cycles/iter from port 0,1,5 pressure | |
- 5 uops to port 2,3 -> at 2/cycle, at least 2.5 cycles/iter | |
From the back-of-the-envelope estimate, we now go from purely backend limited | |
to simultaneously backend and frontend limited, and we'd expect to go from | |
about 5.33 cycles/iter to 5 cycles/iter, for a 6.2% reduction. | |
And indeed, on my work i7-3770S, this change gets us from 5131 Kcyc -> 4762 | |
Kcyc, reducing the cycle count by 7.2%. Close enough, and actually a bit | |
better than expected! | |
This example happens to work out very nicely (since it has a lot of arithmetic | |
and few branch mispredictions or cache misses), but the same general ideas | |
apply elsewhere. Who says that out-of-order cores are so hard to predict? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment