reciprocal.md

R. Alverson - Integer Division Using Reciprocals

For simplicity, this will just go over the unsigned case
Written for someone, such as myself, who isn't as quick to put together the same logical steps

Given $y \in \mathbb{Z}$ where $y > 0$
Find $a,sh \in \mathbb{Z}$ such that $\forall x \in \mathbb{Z} \ (\ \lfloor x/y \rfloor = \lfloor x \cdot a / 2^{sh} \rfloor)$
To motivate what we'll be doing, lets start by considering:

$$ \begin{align*} \frac{x \cdot a}{2^{sh}} &= \quad \frac{x}{y} \\ \frac{a}{2^{sh}} &= \quad \frac{1}{y} \\ a &= \quad \frac{2^{sh}}{y} \\ \end{align*} $$

What we want to show is that $a = \lfloor 2^{sh} / y \rfloor$ can still be used to compute $\lfloor x/y \rfloor$ exactly
So, let $sh = w + \lceil \log_2y \rceil$ where w is our 'effective wordsize' in bits

From an implementation standpoint, we'd like to be able to write:

Recip(uint64_t y){
  uint64_t a, sh;
  // compute a, sh from y
  return (a, sh);
}

which would seem to imply $w = 64$, however we musn't be so hasty.
Given that $\lceil \log_2y \rceil \in \mathbb{Z}$ we have:

$$ \begin{alignedat}{3} \log_2y \ &\leq & \ \lceil \log_2y \rceil &< \ \log_2y + 1 \\ 2^{\log_2y} \ &\leq & \ 2^{\lceil \log_2y \rceil} &< \ 2^{\log_2y + 1} \\ 2^{\log_2y} \ &\leq & \ 2^{\lceil \log_2y \rceil} &< \ 2^{\log_2y} \cdot 2 \\ y \ &\leq & \ 2^{\lceil \log_2y \rceil} &< \ 2y \\ y \ &\leq & \ 2^{\lceil \log_2y \rceil} &\leq \ 2y-1 &\quad \text{since } 2y,\ 2^{\lceil \log_2y \rceil} \in \mathbb{Z} \\ 2^w \cdot y \ &\leq & \ 2^w \cdot 2^{\lceil \log_2y \rceil} &\leq \ 2^w \cdot (2y-1) \\ 2^w \cdot y \ &\leq & \ 2^{w + \lceil \log_2y \rceil} &\leq \ 2^{w+1} \cdot y - 2^{w} \\ 2^w \cdot y \ &\leq & \ 2^{sh} &\leq \ 2^{w+1} \cdot y - 2^{w} &\quad \text{let } sh = w + \lceil \log_2y \rceil\\ 2^w \ &\leq & \ 2^{sh}/y &\leq \ 2^{w+1} - 2^w/y \\ 2^w \ &\leq & \ \lceil 2^{sh}/y \rceil &\leq \ \lceil 2^{w+1} - 2^w/y \rceil \\ 2^w \ &\leq & \ a &\leq \ 2^{w+1} - \lfloor 2^w/y \rfloor \\ 2^w \ &\leq & \ a &\leq \ 2^{w+1} - 1 &\quad \text{when } y \leq 2^w \\ \end{alignedat} $$

So what do we get from the bounds $2^w \leq a \leq \ 2^{w+1} - 1$?
Well, if we know $a$ fits in $w+1$ bits, and since we want to store $a$ in a uint64_t $\Rightarrow w+1 = 64 \Rightarrow w = 63$
Note that above we restricted $y \leq 2^w \equiv y \leq 2^{63}$ which is no longer true for all values uint64_t
If that's a deal breaker, then you'll have to implement the extended precision case for $a$ as an exercise for the reader ;)

Okay, so we know how to find $a,sh$, but how do we know that they compute $\lfloor x/y \rfloor$ exactly?
Well, we're going to have to be a bit more precise with what those $\lfloor … \rfloor$ and $\lceil … \rceil$ are doing

$$ \begin{alignedat}{3} \text{let } q = \left \lfloor \frac{x}{y} \right \rfloor &\Rightarrow \exists \ r_q \in [0,y) \quad q \cdot y + r_q = x \\ \text{let } a = \left \lceil \frac{2^{sh}}{y} \right \rceil &\Rightarrow \exists \ r_a \in [0,y) \quad a \cdot y - r_a = 2^{sh} \\ \end{alignedat} $$

Note, $\lfloor … \rfloor$ can make $q$ smaller, so $r_q$ is added on to compensate
while $\lceil … \rceil$ can make $a$ larger, so $r_a$ is taken away to compensate

Now, if we can show that $r_q$ and $r_a$ make no difference when calculating $q = \lfloor x/y \rfloor = \lfloor x \cdot a / 2^{sh} \rfloor$ then we're done :)

$$ \begin{alignedat}{3} \text{let } Q &= &\lfloor x \cdot a \ /\ 2^{sh} \rfloor \\ &= &\lfloor (q \cdot y + r_q) \cdot a \ /\ 2^{sh} \rfloor \\ &= &\lfloor (q \cdot a y + a\ r_q) \ /\ 2^{sh} \rfloor \\ &= &\lfloor (q \cdot (2^{sh} + r_a) + a\ r_q) \ /\ 2^{sh} \rfloor \\ &= &\lfloor (q \cdot 2^{sh} + q \cdot r_a + a\ r_q) \ /\ 2^{sh} \rfloor \\ &= \ q \ + &\lfloor (q\ r_a + a\ r_q) \ /\ 2^{sh} \rfloor \\ \end{alignedat} $$

To guarantee that $Q = q$ we need to show that $0 \leq q\ r_a + a\ r_q < 2^{sh}$
The non-negative lower bound should come as no surprise, so all that's left is the upper bound

$$ \begin{alignedat}{3} & r_a &< y \\ & r_a &\leq y-1 \\ & q \ r_a &\leq q\ y - q \\ & q \ r_a &\leq (x - r_q) - q \\ & q \ r_a &\leq (x - q) - r_q \\ \end{alignedat} $$

$$ \begin{align*} \text{For an upper bound of } \ (x-q) \ \text{ we'll maximize } x \text{ and minimize } q \\ \text{This resuts in } \ x-q \leq 2^w - 1 < 2^w \end{align*} $$

$$ \begin{alignedat}{3} & \ q \ r_a &\leq (x - q) - r_q \\ & \ q \ r_a &< (2^w) - r_q \\ & \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ a \ r_q - r_q \\ & \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ (a-1) \ r_q \\ & \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ (a-1) \ (y-1) \\ & \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ ay - a - y + 1 \\ & \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ (2^{sh}+r_a) - a - y + 1 \\ & \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ 2^{sh}+(y-1) - a - y + 1 \\ & \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ 2^{sh} - a \\ & \ q \ r_a \ + \ a \ r_q &< 2^{sh} \ + \ 2^w - a \\ & \ q \ r_a \ + \ a \ r_q &< 2^{sh} \quad \text{since } 2^w \leq a \quad \text{might have to scroll up a bit ;)} \\ \end{alignedat} $$

All right, that's enough math, time to get a bit gung-ho with writing the code for it

Recip(uint64_t y){
  assert(y <= (1ull << 63));    // Ensure y <= 2^63
  
  uint64_t sh = 0;
  while(y > (1ull << sh)){ sh++; }         // sh' = ceil(log2(y))
  sh += 63;                                // sh  = ceil(log2(y)) + w
  
  uint64_t a = ((1ull << sh) + y-1) / y    //  a  = ceil(2^sh / y) = floor((2^sh + y-1) / y)
  return (a, sh);
}

FastDivExamp(uint64_t[] x, uint64_t y){
  uint64_t a, sh;
  a, sh = Recip(y);
  for(i, 0, x.count){
    //x[i] = x[i] / y;        // q = floor(x/y)
    x[i] = (x[i]*a) >> sh;    // Q = (x*a) / 2^sh
  }
}

Ahhh, such an elegant mapping from theory to implementation.
Unfortunately, it fails the smell test for anyone paying attention :P

If we tried plugging in y = 8, we'd get sh = 3 + 63 = 66
So, pray tell, what do we expect to get from either (1ull << sh) or (x*a)?

Instead, we'll explicitly handle the pseudo uint128_t operations using a (uint64_t, uint64_t) tuple
This has the benefit where if we have (v1,v0) then returning v1 is equivalent to (v1,v0) >> 64 which we can bake into sh
So the only hurdle left is calculating ((1<<sh) + y-1) / y

We have a convenient mapping from (1<<sh) + (y-1) $\to$ (v1,v0)
v0 = y-1 when $y-1 < 2^{64}$
v1 = 1 << (sh-64) when $2^{64} \leq 2^{sh} < 2^{128}$
The condition for v1 is only true when $64 \leq sh \ \equiv \ 64 \leq 63 + \lceil \log_2y \rceil \ \Rightarrow \ 1 \leq \lceil \log_2y \rceil \ \Rightarrow \ 2 \leq y$
Now, for the case when $y=1$, I propose an alternative algorithm: $\ \forall x \in \mathbb{Z} \ (\ \lfloor x/1 \rfloor = x)$

Finally, we can take our (v1,v0) tuple and pass it to the x64 DIV
While this can raise an exception if the result won't fit within a uint64_t, we've already proven that $a \leq \ 2^{w+1} - 1$

The same idea can be used with x64 MUL and only using the high 64 bits to compute the aforementioned (v1,v0) >> 64

Recip(uint64_t y){
  assert(1 < y && y <= (1ull << 63));    // Ensure 1 < y <= 2^63
  
  uint64_t sh = 0;
  while(y > (1ull << sh)){ sh++; }         // sh' = ceil(log2(y))
  sh += 63 - 64;                           // sh  = ceil(log2(y)) + w - 64
  
  uint64_t a = Div128(1ull<<sh, y-1, y);   //  a  = ceil(2^sh / y) = floor((2^sh + y-1) / y)
  return (a, sh);
}

FastDivExamp(uint64_t[] x, uint64_t y){
  uint64_t a, sh;
  a, sh = Recip(y);
  for(i, 0, x.count){
    //x[i] = x[i] / y;                // q = floor(x/y)
    x[i] = MulHi64(x[i], a) >> sh;    // Q = (x*a) / 2^sh
  }
}

~~TODO~~ Excercises for the reader:

w=64: some people want/need the full range
signed-ints: we can have a sign-bit... as a treat
modulo by constant: hash-maps and divisibility test r=0
div by constant divisor: pointer arithmetic

B-Y-P/reciprocal.md

R. Alverson - Integer Division Using Reciprocals