Binary search is theoretically optimal, but it's possible to speed it up substantially using AVX2 and branchless code even in .NET Core.
Memory access is the limiting factor for binary search. When we access each element for comparison a cache line is loaded, so we could load a 32-byte vector almost free, check if it contains the target value, and if not - reduce the search space by 32/sizeof(T)
elements instead of 1 element. This gives quite good performance improvement (code in BinarySearch1.cs
and results in the table 1 below).
However, for larger N the search space reduction is quite minimal and the most gains come from switching to linear search. After an interesting discussion in Twitter (especially with @trav_downs), and trying to widen the pivot area to use 2 AVX2 vectors it became clear that just switching to linear search sooner is more important than using AVX2 vectors as pivots.
The linear search was not using AVX2, and for linear AVX2 should definitely work, shouldn't it!? With vectorized linear search and some additional branching optimization the performance is improved by additional 30-50% for the most relevant N (code in BinarySearch2.cs
and results in the table 2 below).
The final results vs classic binary search are faster by 65% on average, with near 2x improvement for N=[512,1024]
:
N | Classic | Avx | Avx+ | Avx/Classic | Avx+/Avx | Avx+/Classic |
---|---|---|---|---|---|---|
1 | 569.0 | 390.3 | 630.0 | -31% | 61% | 11% |
2 | 537.5 | 287.0 | 616.1 | -47% | 115% | 15% |
4 | 286.0 | 209.5 | 629.5 | -27% | 201% | 120% |
8 | 185.2 | 290.4 | 247.5 | 57% | -15% | 34% |
16 | 120.4 | 215.3 | 199.4 | 79% | -7% | 66% |
32 | 99.5 | 144.2 | 153.8 | 45% | 7% | 55% |
64 | 76.4 | 119.2 | 129.5 | 56% | 9% | 69% |
128 | 61.2 | 101.5 | 111.1 | 66% | 10% | 82% |
256 | 50.2 | 81.0 | 83.3 | 61% | 3% | 66% |
512 | 29.1 | 43.8 | 74.8 | 50% | 71% | 157% |
1024 | 22.5 | 31.0 | 43.7 | 38% | 41% | 94% |
4096 | 19.0 | 23.3 | 30.7 | 23% | 32% | 62% |
16384 | 17.7 | 20.5 | 28.3 | 16% | 38% | 60% |
65536 | 16.7 | 19.6 | 24.6 | 17% | 26% | 47% |
131072 | 15.9 | 19.1 | 23.1 | 20% | 21% | 45% |
Avg | 28% | 41% | 65% | |||
Min | -47% | -15% | 11% | |||
Max | 79% | 201% | 157% |
AVX512 with _mm256_cmpge_epi64_mask
instruction should improve it even more, but it is not available on .NET yet.
- Avx - using AVX2 vector as pivot
- Avx+ - same as Avx + using AVX2 for linear search
Results for long. Searching [0,1,2,3,...] in [0,2,4,6,...] array (50% hit rate)
Table 1: use AVX2 for pivot
Case | MOPS | Elapsed | GC0 | GC1 | GC2 | Memory |
---|---|---|---|---|---|---|
BS_Classic_2 | 570.38 | 15 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_1 | 569.90 | 15 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_1 | 505.60 | 17 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_2 | 504.03 | 17 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_4 | 503.77 | 17 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_4 | 288.84 | 29 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_8 | 285.34 | 29 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_16 | 186.61 | 45 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_8 | 183.58 | 46 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_32 | 151.42 | 55 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_64 | 117.92 | 71 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_16 | 111.73 | 75 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_128 | 95.59 | 88 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_32 | 90.19 | 93 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_64 | 71.43 | 117 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_256 | 63.05 | 133 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_128 | 60.40 | 139 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_256 | 47.07 | 178 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_512 | 40.02 | 210 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_1024 | 31.25 | 268 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_512 | 28.62 | 293 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_4096 | 22.84 | 367 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_1024 | 22.35 | 375 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_16384 | 19.77 | 424 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_4096 | 19.03 | 441 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_16384 | 17.71 | 474 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
Table 2: use AVX2 for linear search
Case | MOPS | Elapsed | GC0 | GC1 | GC2 | Memory |
---|---|---|---|---|---|---|
BS_Avx+_4 | 519.18 | 40 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_1 | 507.19 | 41 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_2 | 498.41 | 42 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_1 | 372.98 | 56 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_2 | 283.01 | 74 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_8 | 260.14 | 81 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_8 | 244.84 | 86 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_4 | 241.00 | 87 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_16 | 201.39 | 104 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_16 | 197.56 | 106 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_32 | 157.66 | 133 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_32 | 145.97 | 144 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_64 | 127.95 | 164 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_64 | 122.45 | 171 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_128 | 108.82 | 193 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_128 | 98.70 | 212 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_256 | 89.20 | 235 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_256 | 81.84 | 256 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_512 | 72.26 | 290 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_1024 | 44.33 | 473 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_512 | 36.27 | 578 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_4096 | 30.81 | 681 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_1024 | 28.90 | 726 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_16384 | 28.55 | 735 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_65536 | 24.31 | 863 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_131072 | 22.53 | 931 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_4096 | 22.53 | 931 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_16384 | 19.56 | 1,072 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_65536 | 18.17 | 1,154 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_131072 | 17.37 | 1,207 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
Table 3: combined results (different run from table 2)
Case | MOPS | Elapsed | GC0 | GC1 | GC2 | Memory |
---|---|---|---|---|---|---|
BS_Avx+_1 | 629.99 | 33 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_4 | 629.49 | 33 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_2 | 616.09 | 34 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_1 | 569.00 | 37 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_2 | 537.47 | 39 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_1 | 390.32 | 54 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_8 | 290.42 | 72 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_2 | 286.97 | 73 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_4 | 286.03 | 73 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_8 | 247.45 | 85 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_16 | 215.32 | 97 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_4 | 209.48 | 100 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_16 | 199.38 | 105 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_8 | 185.15 | 113 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_32 | 153.79 | 136 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_32 | 144.24 | 145 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_64 | 129.52 | 162 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_16 | 120.41 | 174 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_64 | 119.15 | 176 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_128 | 111.14 | 189 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_128 | 101.45 | 207 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_32 | 99.50 | 211 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_256 | 83.33 | 252 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_256 | 81.02 | 259 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_64 | 76.43 | 274 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_512 | 74.81 | 280 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_128 | 61.20 | 343 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_256 | 50.18 | 418 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_512 | 43.84 | 478 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_1024 | 43.72 | 480 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_1024 | 31.00 | 676 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_4096 | 30.71 | 683 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_512 | 29.13 | 720 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_16384 | 28.26 | 742 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_65536 | 24.60 | 853 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_4096 | 23.28 | 901 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx+_131072 | 23.08 | 909 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_1024 | 22.48 | 933 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_16384 | 20.52 | 1,022 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_65536 | 19.57 | 1,072 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Avx_131072 | 19.06 | 1,100 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_4096 | 18.95 | 1,106 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_16384 | 17.68 | 1,186 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_65536 | 16.71 | 1,255 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |
BS_Classic_131072 | 15.87 | 1,321 ms | 0.0 | 0.0 | 0.0 | 0.000 MB |