When making an algorithm I at some point wanted to measure just how better it is than the serial method. With SIMD you can process multiple elements of an array in parallel at the granularity of the size of the vector registers. So if I have an algorithm that can process 16
,8
,4
,2
,1
elements in parallel, what is the most optimal way I could fire off each algorithm to process an array? Since this is basically a greedy algorithm, divide N
(the size of the array) by 16
, and then divide whats left-over by 8
, and divide whats left-over by 4
, and so on until you've touched every part of the array. So you'd fire off the 16
algorithm one as much as you can, before having to resort to the smaller ones, and eventually reaching 1
where you're processing one element at a time.
Say you have an array of 91
elements, and can only take {16,8,4,2,1} sized "bites" out of it. I want to optimize the total number of "bites" I would have to take to group up all elements of the array.
How many 16
s? How many 8
s? How many 4
s?
Here's a pretty obvious greedy algorithm approach:
Exhaust the number of 16-chunks first, then the 8, then the 4, then the 2, then eventually 1
Starting with 91, fit as many of the largest number within it. Easy integer arithmetic:
91 / 16 = 5
So five 16
s, with 11
elements left to process. How many 8
s can I fit in what the 16
s left over? More integer arithmetic(modulo!):
(91 % 16) / 8 = 1
So I 91 modulo 16
to find out how much the 16
chunk left-over and then divide that by 8
to see how any 8
s I can fit in there. This leaves 3
elements left over.
Now for the 4
s, divide whatever the previously exhausted chunks couldn't process to find out how many times the 4
s should go.
(91 % 16 % 8) / 4 = 0
There are only 3
elements left, so no 4
s fit in here. It will just be a 2
followed by a 1
. This totals to 8 iterations in which I would fire the 16
chunk algorithm 5 times, the 8
-chunk algorithm once, the 4
-chunk never, the 2
-chunk algorithm once, and the 1
-chunk algorithm once.
This is exactly the same as the Change Making problem. With this situation being that the "coins" are always powers of 2
and we have just done the Cashier's algorithm with a "Powers of two" coin system.
- If I have to make
91
cents in change, and I have an unlimited supply of {16
,8
,4
,2
,1
} coins. What is the most optimal amount of each coin I should give?
The general greedy approach can be used to try and solve this, but it may not always provide the most optimal solution unless all steps of your algorithm is producing optimal results can can be a bit of a task to prove depending on your coin system.
Greedy method
For the so-called canonical coin systems, like the one used in the US and many other countries, a greedy algorithm of picking the largest denomination of coin which is not greater than the remaining amount to be made will produce the optimal result. This is not the case for arbitrary coin systems, though: if the coin denominations were 1, 3 and 4, then to make 6, the greedy algorithm would choose three coins (4,1,1) whereas the optimal solution is two coins (3,3).
David Pearson's short paper provides a polynomial way to see if there is any case in which a greedy algorithm failed to produce an optimal result. If there is no counter example(a case where the greedy algorithm produces a suboptimal result) then it is considered "canonical" meaning that the greedy solution is the only optimal solution since there is no counter example of it not producing an optimal result.
Without getting too much into the hairy details, a coin-system of {20
,21
,22
,...,2n
} will always produce an optimal result with the greedy algorithm:
So if I had 91 elements, and I can take {16,8,4,2,1} chunks (always powers of two) and wanted to find out how many iterations I would have total.
5x 16 | ( (91) // 16 ) = 5 "Coins"
1x 8 | ( (91 % 16) // 8 ) = 1 "Coins"
0x 4 | ( (91 % 16 % 8) // 4 ) = 0 "Coins"
1x 2 | ( (91 % 16 % 8 % 4) // 2 ) = 1 "Coins"
1x 1 | ( (91 % 16 % 8 % 4 % 2) // 1 ) = 1 "Coins"
For a total of 5 + 1 + 1 + 1 = 8 total "Coins"
If It was 31 elements:
|L| = 31
|-----------------31----------------|
|------16--------|---------...------| 31 / 16 = 1
|------16--------|--8-----|----...--| 31 % 16 / 8 = 1
|------16--------|--8-----|-4--|....| 31 % 16 % 8 / 4 = 1
|------16--------|--8-----|-4--|-2|.| 31 % 16 % 8 % 4 / 2 = 1
|------16--------|--8-----|-4--|-2|1| 31 % 16 % 8 % 4 % 2 / 1 = 1
Total iterations : 5
If it was 47 elements:
|L| = 47
|---------------------47-------------------------|
|------16--------|------16--------|-----...------| 47 / 16 = 2
|------16--------|------16--------|--8---|--...--| 47 % 16 / 8 = 1
|------16--------|------16--------|--8---|-4-|...| 47 % 16 % 8 / 4 = 1
|------16--------|------16--------|--8---|-4-|2|.| 47 % 16 % 8 % 4 / 2 = 1
|------16--------|------16--------|--8---|-4-|2|1| 47 % 16 % 8 % 4 % 2 / 1 = 1
Total iterations : 6
The general case of the cashier's algorithm has a linear complexity of O(m*n)
where m
is |L|
and n
is the number of types of "coins"(|{16,8,4,2,1}|
= 5
) in the coin-system. Since I'm dealing with binary modulo and division, this can be reduced to plain bit-arithmetic and a pattern can be induced.
The algorithm for determining iteration counts for a 91
element array using
chunk sizes that are the first n
=4
powers of 2:
16-chunks x 5 | ( (91 ) >> 4 ) = 5 < This can greater than 1
8-chunks x 1 | ( (91 & 0b1111 ) >> 3 ) = 1 < These will always be 0 or 1!
4-chunks x 0 | ( (91 & 0b111 ) >> 2 ) = 0 <---^ ^ ^
2-chunks x 1 | ( (91 & 0b11 ) >> 1 ) = 1 <------^ |
1-chunks x 1 | ( (91 & 0b1 ) >> 0 ) = 1 <---------^
Some of the bit-tricks involved here:
- Modulo by a 2
n
is the same as the bitwise "and"(&
) of 2n
- 1- x % 2
n
== x & (2n
- 1) - x % 2
4
== x & (24
- 1) - x %
16
== x &15
== x &0b1111
- x % 2
- Division by 2
n
is the same as bit-shifting to the right byn
bits- x / 2
n
== x >>n
- x / 2
4
== x >>4
- x /
16
== x >>4
- x / 2
After the highest "coin" is exhausted, determining if the smaller algorithms should fire off becomes simple "yes" and "no" binary results.
(91 & 0b1111 ) >> 3 | This is just getting the 4th bit!
(91 & 0b111 ) >> 2 | This is just getting the 3rd bit!
(91 & 0b11 ) >> 1 | This is just getting the 2nd bit!
(91 & 0b1 ) >> 0 | This is just getting the 1st bit!
These will always be a permutation of:
0 + 0 + 0 + 0 = 0
0 + 0 + 0 + 1 = 1
0 + 0 + 1 + 0 = 1
0 + 0 + 1 + 1 = 2
0 + 1 + 0 + 0 = 1
0 + 1 + 0 + 1 = 2
0 + 1 + 1 + 0 = 2
...
Basically just counting bits!
So after realizing this pattern, and realizing that we are always working with a finite amount of integer bits(8,16,32,64) a very fast O(1) shortcut can be induced:
- Given an array
L
of size|L|
, and a grouping of elements by {20
,21
,22
,23
,...,2n
}. The optimalnumber of groups
can be calculated very fast by:- Integer-divide
|L|
by 2n
- Which will always just be a logical bit-shift to the right by
n
bits ( 91 / 16 )
->( 01011011 >> 4 )
->0101
->5
- Which will always just be a logical bit-shift to the right by
- Add the number of
1
bits in the lower bit representation of|L|
to the result from Step 1- This is known as a popcnt in x86.
- popcnt(
0b1011
) ->3
- Integer-divide
All these operations are O(1).
#include <cstddef>
#include <cstdint>
// be sure to use -mpopcnt on gcc to enable the popcnt instruction
std::size_t ChunkCount( std::size_t Length, std::size_t Powers)
{
const std::size_t HighCount = Length >> Powers;
const std::size_t LowCount = __builtin_popcountll(
Length % (1uLL << Powers)
);
return HighCount + LowCount;
}
mov rcx, rsi
mov rdx, -1
sal rdx, cl
not rdx
and rdx, rdi
shr rdi, cl
popcnt rdx, rdx
lea rax, [rdi+rdx]
ret
ChunkCount(91,4) = 8
ChunkCount(31,4) = 5
ChunkCount(47,4) = 6
So now, if I had an array of size 913
and wanted to compare algorithms. A serial algorithm to process this array would have to run 913
times, while a cool SIMD enabled algorithm that can process {1,2,4,8,16,32,64} elements in parallel would only take ChunkCount(91,7) = 16
times. This pattern also arms you with the knowledge of realizing that the smaller algorithms are either going to be ran once or not at all. Meaning if you can process "{1,2,4,8,16,32,64}" elements in parallel, the 64
one can be ran any number of times, while the {1,2,4,8,16,32} ones would either be ran once, or not at all simply by checking the individual bits of the array size. With this you can re-order your logic to test and shift bits for your high granularity options, and then use the rest of the upper bits as a counter for the lower granularity options(exhaust serial and SSE options first, then finish things off with the more expensive AVX/AVX512 ones).