Greedy SIMD algorithms

When making an algorithm I at some point wanted to measure just how better it is than the serial method. With SIMD you can process multiple elements of an array in parallel at the granularity of the size of the vector registers. So if I have an algorithm that can process 16,8,4,2,1 elements in parallel, what is the most optimal way I could fire off each algorithm to process an array? Since this is basically a greedy algorithm, divide N(the size of the array) by 16, and then divide whats left-over by 8, and divide whats left-over by 4, and so on until you've touched every part of the array. So you'd fire off the 16 algorithm one as much as you can, before having to resort to the smaller ones, and eventually reaching 1 where you're processing one element at a time.

Say you have an array of 91 elements, and can only take {16,8,4,2,1} sized "bites" out of it. I want to optimize the total number of "bites" I would have to take to group up all elements of the array. How many 16s? How many 8s? How many 4s?

Here's a pretty obvious greedy algorithm approach:

Exhaust the number of 16-chunks first, then the 8, then the 4, then the 2, then eventually 1

Starting with 91, fit as many of the largest number within it. Easy integer arithmetic:

91 / 16 = 5

So five 16s, with 11 elements left to process. How many 8s can I fit in what the 16s left over? More integer arithmetic(modulo!):

(91 % 16) / 8 = 1

So I 91 modulo 16 to find out how much the 16 chunk left-over and then divide that by 8 to see how any 8s I can fit in there. This leaves 3 elements left over.

Now for the 4s, divide whatever the previously exhausted chunks couldn't process to find out how many times the 4s should go.

(91 % 16 % 8) / 4 = 0

There are only 3 elements left, so no 4s fit in here. It will just be a 2 followed by a 1. This totals to 8 iterations in which I would fire the 16 chunk algorithm 5 times, the 8-chunk algorithm once, the 4-chunk never, the 2-chunk algorithm once, and the 1-chunk algorithm once.

This is exactly the same as the Change Making problem. With this situation being that the "coins" are always powers of 2 and we have just done the Cashier's algorithm with a "Powers of two" coin system.

The 2ⁿ "coin system"

If I have to make 91 cents in change, and I have an unlimited supply of {16,8,4,2,1} coins. What is the most optimal amount of each coin I should give?

The general greedy approach can be used to try and solve this, but it may not always provide the most optimal solution unless all steps of your algorithm is producing optimal results can can be a bit of a task to prove depending on your coin system.

Greedy method

For the so-called canonical coin systems, like the one used in the US and many other countries, a greedy algorithm of picking the largest denomination of coin which is not greater than the remaining amount to be made will produce the optimal result. This is not the case for arbitrary coin systems, though: if the coin denominations were 1, 3 and 4, then to make 6, the greedy algorithm would choose three coins (4,1,1) whereas the optimal solution is two coins (3,3).

David Pearson's short paper provides a polynomial way to see if there is any case in which a greedy algorithm failed to produce an optimal result. If there is no counter example(a case where the greedy algorithm produces a suboptimal result) then it is considered "canonical" meaning that the greedy solution is the only optimal solution since there is no counter example of it not producing an optimal result.

Without getting too much into the hairy details, a coin-system of {2⁰,2¹,2²,...,2ⁿ} will always produce an optimal result with the greedy algorithm:

So if I had 91 elements, and I can take {16,8,4,2,1} chunks (always powers of two) and wanted to find out how many iterations I would have total.

5x 16     | ( (91)                  // 16 ) = 5 "Coins"
1x  8     | ( (91 % 16)             //  8 ) = 1 "Coins"
0x  4     | ( (91 % 16 % 8)         //  4 ) = 0 "Coins"
1x  2     | ( (91 % 16 % 8 % 4)     //  2 ) = 1 "Coins"
1x  1     | ( (91 % 16 % 8 % 4 % 2) //  1 ) = 1 "Coins"
For a total of 5 + 1 + 1 + 1 = 8 total "Coins"

If It was 31 elements:

|L| =  31
|-----------------31----------------|
|------16--------|---------...------| 31                  / 16 = 1
|------16--------|--8-----|----...--| 31 % 16             /  8 = 1
|------16--------|--8-----|-4--|....| 31 % 16 % 8         /  4 = 1
|------16--------|--8-----|-4--|-2|.| 31 % 16 % 8 % 4     /  2 = 1
|------16--------|--8-----|-4--|-2|1| 31 % 16 % 8 % 4 % 2 /  1 = 1
                                              Total iterations : 5

If it was 47 elements:

|L| = 47
|---------------------47-------------------------|
|------16--------|------16--------|-----...------| 47                  / 16 = 2
|------16--------|------16--------|--8---|--...--| 47 % 16             /  8 = 1
|------16--------|------16--------|--8---|-4-|...| 47 % 16 % 8         /  4 = 1
|------16--------|------16--------|--8---|-4-|2|.| 47 % 16 % 8 % 4     /  2 = 1
|------16--------|------16--------|--8---|-4-|2|1| 47 % 16 % 8 % 4 % 2 /  1 = 1
                                                           Total iterations : 6

Optimization

The general case of the cashier's algorithm has a linear complexity of O(m*n) where m is |L| and n is the number of types of "coins"(|{16,8,4,2,1}| = 5) in the coin-system. Since I'm dealing with binary modulo and division, this can be reduced to plain bit-arithmetic and a pattern can be induced.

The algorithm for determining iteration counts for a 91 element array using chunk sizes that are the first n=4 powers of 2:

  16-chunks x 5  | ( (91          )  >>  4 ) = 5 < This can greater than 1
   8-chunks x 1  | ( (91 & 0b1111 )  >>  3 ) = 1 < These will always be 0 or 1!
   4-chunks x 0  | ( (91 & 0b111  )  >>  2 ) = 0 <---^  ^  ^
   2-chunks x 1  | ( (91 & 0b11   )  >>  1 ) = 1 <------^  |
   1-chunks x 1  | ( (91 & 0b1    )  >>  0 ) = 1 <---------^

Some of the bit-tricks involved here:

Modulo by a 2ⁿ is the same as the bitwise "and"(&) of 2ⁿ - 1
- x % 2ⁿ == x & (2ⁿ - 1)
- x % 2⁴ == x & (2⁴ - 1)
- x % 16 == x & 15 == x & 0b1111
Division by 2ⁿ is the same as bit-shifting to the right by n bits
- x / 2ⁿ == x >> n
- x / 2⁴ == x >> 4
- x / 16 == x >> 4

After the highest "coin" is exhausted, determining if the smaller algorithms should fire off becomes simple "yes" and "no" binary results.

(91 & 0b1111 )  >>  3  | This is just getting the 4th bit!
(91 & 0b111  )  >>  2  | This is just getting the 3rd bit!
(91 & 0b11   )  >>  1  | This is just getting the 2nd bit!
(91 & 0b1    )  >>  0  | This is just getting the 1st bit!
These will always be a permutation of:
0 + 0 + 0 + 0 = 0
0 + 0 + 0 + 1 = 1
0 + 0 + 1 + 0 = 1
0 + 0 + 1 + 1 = 2
0 + 1 + 0 + 0 = 1
0 + 1 + 0 + 1 = 2
0 + 1 + 1 + 0 = 2
...
Basically just counting bits!

So after realizing this pattern, and realizing that we are always working with a finite amount of integer bits(8,16,32,64) a very fast O(1) shortcut can be induced:

Given an array L of size |L|, and a grouping of elements by {2⁰,2¹,2²,2³,...,2ⁿ}. The optimal number of groups can be calculated very fast by:
1. Integer-divide |L| by 2ⁿ
  - Which will always just be a logical bit-shift to the right by n bits
  - ( 91 / 16 ) -> ( 01011011 >> 4 ) -> 0101 -> 5
2. Add the number of 1 bits in the lower bit representation of |L| to the result from Step 1
  - This is known as a popcnt in x86.
  - popcnt( 0b1011 ) -> 3

All these operations are O(1).

#include <cstddef>
#include <cstdint>
// be sure to use -mpopcnt on gcc to enable the popcnt instruction
std::size_t ChunkCount( std::size_t Length, std::size_t Powers)
{
	const std::size_t HighCount = Length >> Powers;
	const std::size_t LowCount = __builtin_popcountll(
		Length % (1uLL << Powers)
	);
	return HighCount + LowCount;
}

  mov rcx, rsi
  mov rdx, -1
  sal rdx, cl
  not rdx
  and rdx, rdi
  shr rdi, cl
  popcnt rdx, rdx
  lea rax, [rdi+rdx]
  ret

ChunkCount(91,4) = 8
ChunkCount(31,4) = 5
ChunkCount(47,4) = 6

So now, if I had an array of size 913 and wanted to compare algorithms. A serial algorithm to process this array would have to run 913 times, while a cool SIMD enabled algorithm that can process {1,2,4,8,16,32,64} elements in parallel would only take ChunkCount(91,7) = 16 times. This pattern also arms you with the knowledge of realizing that the smaller algorithms are either going to be ran once or not at all. Meaning if you can process "{1,2,4,8,16,32,64}" elements in parallel, the 64 one can be ran any number of times, while the {1,2,4,8,16,32} ones would either be ran once, or not at all simply by checking the individual bits of the array size. With this you can re-order your logic to test and shift bits for your high granularity options, and then use the rest of the upper bits as a counter for the lower granularity options(exhaust serial and SSE options first, then finish things off with the more expensive AVX/AVX512 ones).

Wunkolo/GreedyPower2.md

Greedy SIMD algorithms

The 2n "coin system"

Optimization

The 2ⁿ "coin system"