For simplicity, this will just go over the unsigned case
Written for someone, such as myself, who isn't as quick to put together the same logical steps
Given $y \in \mathbb{Z}$ where $y > 0$
Find $a,sh \in \mathbb{Z}$ such that $\forall x \in \mathbb{Z} \ (\ \lfloor x/y \rfloor = \lfloor x \cdot a / 2^{sh} \rfloor)$
To motivate what we'll be doing, lets start by considering:
$$
\begin{align*}
\frac{x \cdot a}{2^{sh}} &= \quad \frac{x}{y} \\
\frac{a}{2^{sh}} &= \quad \frac{1}{y} \\
a &= \quad \frac{2^{sh}}{y} \\
\end{align*}
$$
What we want to show is that $a = \lfloor 2^{sh} / y \rfloor$ can still be used to compute $\lfloor x/y \rfloor$ exactly
So, let $sh = w + \lceil \log_2y \rceil$ where w is our 'effective wordsize' in bits
From an implementation standpoint, we'd like to be able to write:
Recip(uint64_t y){
uint64_t a, sh;
// compute a, sh from y
return (a, sh);
}
which would seem to imply $w = 64$, however we musn't be so hasty.
Given that $\lceil \log_2y \rceil \in \mathbb{Z}$ we have:
$$
\begin{alignedat}{3}
\log_2y \ &\leq & \ \lceil \log_2y \rceil &< \ \log_2y + 1 \\
2^{\log_2y} \ &\leq & \ 2^{\lceil \log_2y \rceil} &< \ 2^{\log_2y + 1} \\
2^{\log_2y} \ &\leq & \ 2^{\lceil \log_2y \rceil} &< \ 2^{\log_2y} \cdot 2 \\
y \ &\leq & \ 2^{\lceil \log_2y \rceil} &< \ 2y \\
y \ &\leq & \ 2^{\lceil \log_2y \rceil} &\leq \ 2y-1 &\quad \text{since } 2y,\ 2^{\lceil \log_2y \rceil} \in \mathbb{Z} \\
2^w \cdot y \ &\leq & \ 2^w \cdot 2^{\lceil \log_2y \rceil} &\leq \ 2^w \cdot (2y-1) \\
2^w \cdot y \ &\leq & \ 2^{w + \lceil \log_2y \rceil} &\leq \ 2^{w+1} \cdot y - 2^{w} \\
2^w \cdot y \ &\leq & \ 2^{sh} &\leq \ 2^{w+1} \cdot y - 2^{w} &\quad \text{let } sh = w + \lceil \log_2y \rceil\\
2^w \ &\leq & \ 2^{sh}/y &\leq \ 2^{w+1} - 2^w/y \\
2^w \ &\leq & \ \lceil 2^{sh}/y \rceil &\leq \ \lceil 2^{w+1} - 2^w/y \rceil \\
2^w \ &\leq & \ a &\leq \ 2^{w+1} - \lfloor 2^w/y \rfloor \\
2^w \ &\leq & \ a &\leq \ 2^{w+1} - 1 &\quad \text{when } y \leq 2^w \\
\end{alignedat}
$$
So what do we get from the bounds $2^w \leq a \leq \ 2^{w+1} - 1$?
Well, if we know $a$ fits in $w+1$ bits, and since we want to store $a$ in a uint64_t
$\Rightarrow w+1 = 64 \Rightarrow w = 63$
Note that above we restricted $y \leq 2^w \equiv y \leq 2^{63}$ which is no longer true for all values uint64_t
If that's a deal breaker, then you'll have to implement the extended precision case for $a$ as an exercise for the reader ;)
Okay, so we know how to find $a,sh$, but how do we know that they compute $\lfloor x/y \rfloor$ exactly?
Well, we're going to have to be a bit more precise with what those $\lfloor … \rfloor$ and $\lceil … \rceil$ are doing
$$
\begin{alignedat}{3}
\text{let } q = \left \lfloor \frac{x}{y} \right \rfloor &\Rightarrow \exists \ r_q \in [0,y) \quad q \cdot y + r_q = x \\
\text{let } a = \left \lceil \frac{2^{sh}}{y} \right \rceil &\Rightarrow \exists \ r_a \in [0,y) \quad a \cdot y - r_a = 2^{sh} \\
\end{alignedat}
$$
Note, $\lfloor … \rfloor$ can make $q$ smaller, so $r_q$ is added on to compensate
while $\lceil … \rceil$ can make $a$ larger, so $r_a$ is taken away to compensate
Now, if we can show that $r_q$ and $r_a$ make no difference when calculating $q = \lfloor x/y \rfloor = \lfloor x \cdot a / 2^{sh} \rfloor$ then we're done :)
$$
\begin{alignedat}{3}
\text{let } Q &= &\lfloor x \cdot a \ /\ 2^{sh} \rfloor \\
&= &\lfloor (q \cdot y + r_q) \cdot a \ /\ 2^{sh} \rfloor \\
&= &\lfloor (q \cdot a y + a\ r_q) \ /\ 2^{sh} \rfloor \\
&= &\lfloor (q \cdot (2^{sh} + r_a) + a\ r_q) \ /\ 2^{sh} \rfloor \\
&= &\lfloor (q \cdot 2^{sh} + q \cdot r_a + a\ r_q) \ /\ 2^{sh} \rfloor \\
&= \ q \ + &\lfloor (q\ r_a + a\ r_q) \ /\ 2^{sh} \rfloor \\
\end{alignedat}
$$
To guarantee that $Q = q$ we need to show that $0 \leq q\ r_a + a\ r_q < 2^{sh}$
The non-negative lower bound should come as no surprise, so all that's left is the upper bound
$$
\begin{alignedat}{3}
& r_a &< y \\
& r_a &\leq y-1 \\
& q \ r_a &\leq q\ y - q \\
& q \ r_a &\leq (x - r_q) - q \\
& q \ r_a &\leq (x - q) - r_q \\
\end{alignedat}
$$
$$
\begin{align*}
\text{For an upper bound of } \ (x-q) \ \text{ we'll maximize } x \text{ and minimize } q \\
\text{This resuts in } \ x-q \leq 2^w - 1 < 2^w
\end{align*}
$$
$$
\begin{alignedat}{3}
& \ q \ r_a &\leq (x - q) - r_q \\
& \ q \ r_a &< (2^w) - r_q \\
& \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ a \ r_q - r_q \\
& \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ (a-1) \ r_q \\
& \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ (a-1) \ (y-1) \\
& \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ ay - a - y + 1 \\
& \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ (2^{sh}+r_a) - a - y + 1 \\
& \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ 2^{sh}+(y-1) - a - y + 1 \\
& \ q \ r_a \ + \ a \ r_q &< 2^w \ + \ 2^{sh} - a \\
& \ q \ r_a \ + \ a \ r_q &< 2^{sh} \ + \ 2^w - a \\
& \ q \ r_a \ + \ a \ r_q &< 2^{sh} \quad \text{since } 2^w \leq a \quad \text{might have to scroll up a bit ;)} \\
\end{alignedat}
$$
All right, that's enough math, time to get a bit gung-ho with writing the code for it
Recip(uint64_t y){
assert(y <= (1ull << 63)); // Ensure y <= 2^63
uint64_t sh = 0;
while(y > (1ull << sh)){ sh++; } // sh' = ceil(log2(y))
sh += 63; // sh = ceil(log2(y)) + w
uint64_t a = ((1ull << sh) + y-1) / y // a = ceil(2^sh / y) = floor((2^sh + y-1) / y)
return (a, sh);
}
FastDivExamp(uint64_t[] x, uint64_t y){
uint64_t a, sh;
a, sh = Recip(y);
for(i, 0, x.count){
//x[i] = x[i] / y; // q = floor(x/y)
x[i] = (x[i]*a) >> sh; // Q = (x*a) / 2^sh
}
}
Ahhh, such an elegant mapping from theory to implementation.
Unfortunately, it fails the smell test for anyone paying attention :P
If we tried plugging in y = 8
, we'd get sh = 3 + 63 = 66
So, pray tell, what do we expect to get from either (1ull << sh)
or (x*a)
?
Instead, we'll explicitly handle the pseudo uint128_t
operations using a (uint64_t, uint64_t)
tuple
This has the benefit where if we have (v1,v0)
then returning v1
is equivalent to (v1,v0) >> 64
which we can bake into sh
So the only hurdle left is calculating ((1<<sh) + y-1) / y
We have a convenient mapping from (1<<sh) + (y-1)
$\to$ (v1,v0)
v0 = y-1
when $y-1 < 2^{64}$
v1 = 1 << (sh-64)
when $2^{64} \leq 2^{sh} < 2^{128}$
The condition for v1
is only true when $64 \leq sh \ \equiv \ 64 \leq 63 + \lceil \log_2y \rceil \ \Rightarrow \ 1 \leq \lceil \log_2y \rceil \ \Rightarrow \ 2 \leq y$
Now, for the case when $y=1$, I propose an alternative algorithm: $\ \forall x \in \mathbb{Z} \ (\ \lfloor x/1 \rfloor = x)$
Finally, we can take our (v1,v0)
tuple and pass it to the x64 DIV
While this can raise an exception if the result won't fit within a uint64_t
, we've already proven that $a \leq \ 2^{w+1} - 1$
The same idea can be used with x64 MUL and only using the high 64 bits to compute the aforementioned (v1,v0) >> 64
Recip(uint64_t y){
assert(1 < y && y <= (1ull << 63)); // Ensure 1 < y <= 2^63
uint64_t sh = 0;
while(y > (1ull << sh)){ sh++; } // sh' = ceil(log2(y))
sh += 63 - 64; // sh = ceil(log2(y)) + w - 64
uint64_t a = Div128(1ull<<sh, y-1, y); // a = ceil(2^sh / y) = floor((2^sh + y-1) / y)
return (a, sh);
}
FastDivExamp(uint64_t[] x, uint64_t y){
uint64_t a, sh;
a, sh = Recip(y);
for(i, 0, x.count){
//x[i] = x[i] / y; // q = floor(x/y)
x[i] = MulHi64(x[i], a) >> sh; // Q = (x*a) / 2^sh
}
}
TODO Excercises for the reader:
- w=64: some people want/need the full range
- signed-ints: we can have a sign-bit... as a treat
- modulo by constant: hash-maps and divisibility test r=0
- div by constant divisor: pointer arithmetic