rygorous · February 17, 2017 00:41
diff --git a/box_pruning_notes.txt b/box_pruning_notes.txt
 Brief explanation what I did to get the speed-up, and the thought process behind it.

 The original code went:

 EnterLoop:
 			movaps		xmm3, xmmword ptr [edx+ecx*2]		// Box1YZ
 			cmpnltps	xmm3, xmm2
 			movmskps	eax, xmm3

 			cmp			eax, 0Ch
 			jne			NoOverlap
      
 			<code to report overlap (this runs rarely)>
      
 NoOverlap:
 			add			ecx, 8
 			comiss		xmm1, xmmword ptr [esi+ecx]		// [esi] = BoxListX[Index1].mMinX, compared to MaxLimit
 			jae			EnterLoop
 			
 My first suggestion was to restructure the loop slightly so the hot "no overlap" path is
 straight-line and the cold "found overlap" path has the extra jumps. This can help
 instruction fetch behavior, although in this case it didn't make a difference.
 Nevertheless, I'll do it here because it makes things easier to follow:

 EnterLoop:
 			movaps		xmm3, xmmword ptr [edx+ecx*2]		// Box1YZ
 			cmpnltps	xmm3, xmm2
 			movmskps	eax, xmm3

 			cmp			eax, 0Ch
 			je			FoundOverlap

 ResumeAfterOverlap:
 			add			ecx, 8
 			comiss		xmm1, xmmword ptr [esi+ecx]		// [esi] = BoxListX[Index1].mMinX, compared to MaxLimit
 			jae			EnterLoop

 Alright, so that's a nice, sweet, simple loop. Now a lot of people will tell you that
 out-of-order cores are hard to optimize for since they're "unpredictable" or "fuzzy"
 or whatever. I disagree: optimizing for out-of-order cores is *easy* and far less
 tedious than say manual scheduling for in-order machines is. It's true that for OoO,
 you can't just give a fixed "clock cycles per iteration" number, but the same thing
 is already true for *anything* with a cache, so who are we kidding? The reality of the
 situation is that while predicting the exact flow uops are gonna take through the
 machine is hard (and also fairly pointless unless you're trying to build an exact
 pipeline simulator), quantifying the overall statistical behavior of loops on OoO
 cores is often easier than it is for in-order machines. Because for nice simple loops
 like this, it boils down to operation counting - total number of instructions, and
 total amount of work going to different types of functional units. We don't need
 to worry about scheduling; the cores can take care of that themselves, and the
 loop above has no tricky data dependencies between iterations (the only inter-iteration
 change is the "add ecx, 8", which doesn't depend on anything else in the loop) so
 everything is gonna work fine on that front. So, on to the counting. I'm counting
 two things here: 1. "fused domain" uops (to a first-order approximation, this means
 "instructions as broken down by the CPU front-end") and 2. un-fused uops going to
 specific groups of functional units ("ports"), which is what the CPU back-end deals
 with. When I write "unfused p0", I mean an unfused uop that has to go to port 0.
 "unfused 1 p23" is an unfused uop that can go to ports 2 or 3 (whichever happens to
 be free). I'm using stats for the i7-2600K in my machine (Intel Sandy Bridge); newer
 CPUs have slightly different (but still similar) breakdowns. Now without further ado,
 we have:

 EnterLoop:
 			movaps		xmm3, xmmword ptr [edx+ecx*2]		// 1 fused uop=unfused 1 p23
 			cmpnltps	xmm3, xmm2							// 1 fused uop=unfused 1 p1
 			movmskps	eax, xmm3							// 1 fused uop=unfused 1 p0

 			cmp			eax, 0Ch							// \
 			je			FoundOverlap						// / cmp+je=1 fused uop=unfused 1 p5

 ResumeAfterOverlap:
 			add			ecx, 8								// 1 fused uop=unfused 1 p015
 			comiss		xmm1, xmmword ptr [esi+ecx]			// 2 fused uops=unfused 1 p0 + 1 p1 + 1 p23
 			jae			EnterLoop							// 1 fused uop=unfused 1 p5

 (yes, the pair of x86 instructions cmp+je combines into one fused uop.)

 Fused uops are the currency the CPU frontend deals in. It can process at most 4
 of these per cycle, under ideal conditions, although in practice (for various
 reasons) it's generally hard to average much more than 3 fused uops/cycle unless
 the loop is relatively short (which, luckily, this one is). All the ports can
 accept one instruction per cycle.

 So total, we have:
 - 8 fused uops   -> at 4/cycle, at least 2 cycles/iter
 - 7 uops going to ports 0,1,5
  - 2 port 0 only
  - 2 port 1 only
  - 2 port 5 only
  - 1 "wildcard" that can go anywhere
  -> at least 2.33 cycles/iter from port 0,1,5 pressure
 - 2 uops going to ports 2,3 -> at 2/cycle, at least 1 cycles/iter

 And of that total, the actual box pruning test (the first 5 x86 instructions)
 are 4 fused uops, 3 unfused p015 and 1 unfused p23 - a single cycle's worth
 of work. In other words, we spend more than half of our execution bandwidth
 on loop overhead. That's no good.

 Hence, unroll 4x. With that, provided there *are* at least 4 boxes to test
 against in the current cluster, we end up with:

 - 4*4 + 4 = 20 fused uops -> at 4/cycle, at least 5 cycles/iter
 - 4*3 + 4 = 16 uops going to ports 0,1,5
  - 4*1+1=5 p0
  - 4*1+1=5 p1
  - 4*1+1=5 p2
  - 1 "wildcard"
  -> at least 5.33 cycles/iter from port 0,1,5 pressure
 - 5 uops to port 2,3 -> at 2/cycle, at least 2.5 cycles/iter

 Our bottleneck are once again ports 0,1,5, but they now process 4 candidate
 pairs in 5.33 cycles worth of work, whereas they took 9.33 cycles worth of
 work before. So from that analysis, we expect something like a 42.8%
 reduction in execution time, theoretical. Actual observed reduction was
 34.4% on my home i7-2600K (12038 Kcyc -> 7893 Kcyc) and 42.9% on my
 work i7-3770S (8990 Kcyc -> 5131 Kcyc). So the Sandy Bridge i7-2600K also
 runs into some other limits not accounted for in this (very simplistic!)
 analysis whereas the i7-3770S behaves *exactly* as predicted.

 The other tweak I tried later was to switch things around so the box X
 coordinates are converted to integers. The issue is our 2-fused-uop COMISS,
 which we'd like to replace with a 1-fused-uop compare. Not only is the
 integer version fewer uops, the CMP uop is also p015 instead of the more
 constrained p0+p1 for COMISS.

 What would we expect from that? Our new totals are:
 - 4*4 + 4 = 20 fused uops -> at 4/cycle, at least 5 cycles/iter
 - 4*3 + 3 = 15 uops going to ports 0,1,5
  - 4*1+1=4 p0
  - 4*1+1=4 p1
  - 4*1+1=5 p2
  - 2 "wildcard" -> note these can go to p0/p1, and now we're perfectly balanced!
  -> at least 5 cycles/iter from port 0,1,5 pressure
 - 5 uops to port 2,3 -> at 2/cycle, at least 2.5 cycles/iter

 From the back-of-the-envelope estimate, we now go from purely backend limited
 to simultaneously backend and frontend limited, and we'd expect to go from
 about 5.33 cycles/iter to 5 cycles/iter, for a 6.2% reduction.

 And indeed, on my work i7-3770S, this change gets us from 5131 Kcyc -> 4762
 Kcyc, reducing the cycle count by 7.2%. Close enough, and actually a bit
 better than expected!

 This example happens to work out very nicely (since it has a lot of arithmetic
 and few branch mispredictions or cache misses), but the same general ideas
 apply elsewhere. Who says that out-of-order cores are so hard to predict?
	Brief explanation what I did to get the speed-up, and the thought process behind it.

	The original code went:

	EnterLoop:
	movaps xmm3, xmmword ptr [edx+ecx*2] // Box1YZ
	cmpnltps xmm3, xmm2
	movmskps eax, xmm3

	cmp eax, 0Ch
	jne NoOverlap

	<code to report overlap (this runs rarely)>

	NoOverlap:
	add ecx, 8
	comiss xmm1, xmmword ptr [esi+ecx] // [esi] = BoxListX[Index1].mMinX, compared to MaxLimit
	jae EnterLoop

	My first suggestion was to restructure the loop slightly so the hot "no overlap" path is
	straight-line and the cold "found overlap" path has the extra jumps. This can help
	instruction fetch behavior, although in this case it didn't make a difference.
	Nevertheless, I'll do it here because it makes things easier to follow:

	EnterLoop:
	movaps xmm3, xmmword ptr [edx+ecx*2] // Box1YZ
	cmpnltps xmm3, xmm2
	movmskps eax, xmm3

	cmp eax, 0Ch
	je FoundOverlap

	ResumeAfterOverlap:
	add ecx, 8
	comiss xmm1, xmmword ptr [esi+ecx] // [esi] = BoxListX[Index1].mMinX, compared to MaxLimit
	jae EnterLoop

	Alright, so that's a nice, sweet, simple loop. Now a lot of people will tell you that
	out-of-order cores are hard to optimize for since they're "unpredictable" or "fuzzy"
	or whatever. I disagree: optimizing for out-of-order cores is easy and far less
	tedious than say manual scheduling for in-order machines is. It's true that for OoO,
	you can't just give a fixed "clock cycles per iteration" number, but the same thing
	is already true for anything with a cache, so who are we kidding? The reality of the
	situation is that while predicting the exact flow uops are gonna take through the
	machine is hard (and also fairly pointless unless you're trying to build an exact
	pipeline simulator), quantifying the overall statistical behavior of loops on OoO
	cores is often easier than it is for in-order machines. Because for nice simple loops
	like this, it boils down to operation counting - total number of instructions, and
	total amount of work going to different types of functional units. We don't need
	to worry about scheduling; the cores can take care of that themselves, and the
	loop above has no tricky data dependencies between iterations (the only inter-iteration
	change is the "add ecx, 8", which doesn't depend on anything else in the loop) so
	everything is gonna work fine on that front. So, on to the counting. I'm counting
	two things here: 1. "fused domain" uops (to a first-order approximation, this means
	"instructions as broken down by the CPU front-end") and 2. un-fused uops going to
	specific groups of functional units ("ports"), which is what the CPU back-end deals
	with. When I write "unfused p0", I mean an unfused uop that has to go to port 0.
	"unfused 1 p23" is an unfused uop that can go to ports 2 or 3 (whichever happens to
	be free). I'm using stats for the i7-2600K in my machine (Intel Sandy Bridge); newer
	CPUs have slightly different (but still similar) breakdowns. Now without further ado,
	we have:

	EnterLoop:
	movaps xmm3, xmmword ptr [edx+ecx*2] // 1 fused uop=unfused 1 p23
	cmpnltps xmm3, xmm2 // 1 fused uop=unfused 1 p1
	movmskps eax, xmm3 // 1 fused uop=unfused 1 p0

	cmp eax, 0Ch // \
	je FoundOverlap // / cmp+je=1 fused uop=unfused 1 p5

	ResumeAfterOverlap:
	add ecx, 8 // 1 fused uop=unfused 1 p015
	comiss xmm1, xmmword ptr [esi+ecx] // 2 fused uops=unfused 1 p0 + 1 p1 + 1 p23
	jae EnterLoop // 1 fused uop=unfused 1 p5

	(yes, the pair of x86 instructions cmp+je combines into one fused uop.)

	Fused uops are the currency the CPU frontend deals in. It can process at most 4
	of these per cycle, under ideal conditions, although in practice (for various
	reasons) it's generally hard to average much more than 3 fused uops/cycle unless
	the loop is relatively short (which, luckily, this one is). All the ports can
	accept one instruction per cycle.

	So total, we have:
	- 8 fused uops -> at 4/cycle, at least 2 cycles/iter
	- 7 uops going to ports 0,1,5
	- 2 port 0 only
	- 2 port 1 only
	- 2 port 5 only
	- 1 "wildcard" that can go anywhere
	-> at least 2.33 cycles/iter from port 0,1,5 pressure
	- 2 uops going to ports 2,3 -> at 2/cycle, at least 1 cycles/iter

	And of that total, the actual box pruning test (the first 5 x86 instructions)
	are 4 fused uops, 3 unfused p015 and 1 unfused p23 - a single cycle's worth
	of work. In other words, we spend more than half of our execution bandwidth
	on loop overhead. That's no good.

	Hence, unroll 4x. With that, provided there are at least 4 boxes to test
	against in the current cluster, we end up with:

	- 4*4 + 4 = 20 fused uops -> at 4/cycle, at least 5 cycles/iter
	- 4*3 + 4 = 16 uops going to ports 0,1,5
	- 4*1+1=5 p0
	- 4*1+1=5 p1
	- 4*1+1=5 p2
	- 1 "wildcard"
	-> at least 5.33 cycles/iter from port 0,1,5 pressure
	- 5 uops to port 2,3 -> at 2/cycle, at least 2.5 cycles/iter

	Our bottleneck are once again ports 0,1,5, but they now process 4 candidate
	pairs in 5.33 cycles worth of work, whereas they took 9.33 cycles worth of
	work before. So from that analysis, we expect something like a 42.8%
	reduction in execution time, theoretical. Actual observed reduction was
	34.4% on my home i7-2600K (12038 Kcyc -> 7893 Kcyc) and 42.9% on my
	work i7-3770S (8990 Kcyc -> 5131 Kcyc). So the Sandy Bridge i7-2600K also
	runs into some other limits not accounted for in this (very simplistic!)
	analysis whereas the i7-3770S behaves exactly as predicted.

	The other tweak I tried later was to switch things around so the box X
	coordinates are converted to integers. The issue is our 2-fused-uop COMISS,
	which we'd like to replace with a 1-fused-uop compare. Not only is the
	integer version fewer uops, the CMP uop is also p015 instead of the more
	constrained p0+p1 for COMISS.

	What would we expect from that? Our new totals are:
	- 4*4 + 4 = 20 fused uops -> at 4/cycle, at least 5 cycles/iter
	- 4*3 + 3 = 15 uops going to ports 0,1,5
	- 4*1+1=4 p0
	- 4*1+1=4 p1
	- 4*1+1=5 p2
	- 2 "wildcard" -> note these can go to p0/p1, and now we're perfectly balanced!
	-> at least 5 cycles/iter from port 0,1,5 pressure
	- 5 uops to port 2,3 -> at 2/cycle, at least 2.5 cycles/iter

	From the back-of-the-envelope estimate, we now go from purely backend limited
	to simultaneously backend and frontend limited, and we'd expect to go from
	about 5.33 cycles/iter to 5 cycles/iter, for a 6.2% reduction.

	And indeed, on my work i7-3770S, this change gets us from 5131 Kcyc -> 4762
	Kcyc, reducing the cycle count by 7.2%. Close enough, and actually a bit
	better than expected!

	This example happens to work out very nicely (since it has a lot of arithmetic
	and few branch mispredictions or cache misses), but the same general ideas
	apply elsewhere. Who says that out-of-order cores are so hard to predict?