Created
December 2, 2015 19:16
-
-
Save Wurstnase/a428edd40768428b0ad1 to your computer and use it in GitHub Desktop.
Wilco 3cycle/bit square root
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Source:http://www.verycomputer.com/24_e95a6e361498c566_1.htm | |
In the 1/sqrt() thread, I made the remark that a square root does take | |
only 3 cycles per bit. OK, here is the algorithm: | |
Goal: Calculate root = INT (SQRT (N)), for N = 0 .. 2^32 -1 | |
Idea: Successive approximation of the equation (root + delta) ^ 2 = N until | |
delta < 1. If delta < 1 we have the integer part of SQRT (N). | |
Use delta = 2^i for i = 15 .. 0. | |
Init: Start with root = 0 and delta = 2^15. | |
Step: If (root + delta) ^ 2 <= N then | |
root += delta | |
endif | |
delta = delta / 2 | |
This step is repeated until delta = 0. Now root = INT (SQRT (N)). | |
Idea: (root + delta) ^ 2 = root^2 + 2 * root * delta + delta^2 = | |
root^2 + delta * (2 * root + delta). | |
Now take M = N - root^2, and update M in each step of the algorithm: | |
Init: M = N, delta = 2^15 | |
Step: if delta * (2 * root + delta) <= M then | |
M -= delta * (2 * root + delta) | |
root += delta | |
endif | |
delta = delta / 2 | |
We need to calculate delta * (2 * root + delta) fast, without any | |
multiplications. This is possible because delta is a power of 2. | |
In ARM code we might optimize a bit by taking root' = root *2, and | |
using loop unrolling for delta = 2^i. | |
MOV M, N | |
[ unroll for i = 15 .. 0 | |
ADD t, root, #1 << i ; t = 2 * root + delta | |
CMP M, t, LSL #i ; if t * delta <= M then | |
SUBHS M, M, t, LSL #i ; M -= t * delta | |
ADDHS root, root, #2 << i ; root += delta | |
] | |
MOV root, root, LSR #1 | |
OK, now we have a 4-cycle per bit routine. For 3-cycle per bit we have to | |
leave one instruction out... The idea is to calculate 2 * root + delta | |
successively (so we don't need the ADD t, root, #1 << i instruction). | |
In binary the value 2 * root + delta looks like this: | |
xxxx01 ; the xxxx bits are the value of root, the zero is from 2 * root, | |
and the 1 is the delta bit. | |
Now look at successive values of 2 * root + delta: | |
xxxx010000 | |
xxxxa01000 ; a is the new bit of root (from root += delta) | |
xxxxab0100 ; b new bit of root | |
xxxxabc010 | |
It can be seen that the new bit of the new approximation of root is put on | |
the zero bit of the previous approximation, while the delta bit shifts one | |
position. We want to calculate this new approximation in only ONE instruction! | |
If you forget about the delta bit, and shift root to the right, you would get: | |
xxxx | |
xxxxa | |
xxxxab | |
This is an ADC root, root, root instruction. Now, we use a rotate right, | |
instead of a shift right: | |
010000xxxx | |
01000xxxxa | |
0100xxxxab | |
This can be done with an ADC root, offset, root, LSL #1 instruction. | |
Take offset = 3^30: | |
010000xxxx ; original root | |
10000xxxx0 ; root LSL #1 | |
10000xxxxa ; root LSL #1 + carry bit | |
01000xxxxa ; ADC root, offset, root, LSL #1 | |
Et voila! Now we have the following ARM routine: | |
; IN : n 32 bit unsigned integer | |
; OUT: root = INT (SQRT (n)) | |
; TMP: offset | |
MOV offset, #3 << 30 | |
MOV root, #1 << 30 | |
[ unroll for i = 0 .. 15 | |
CMP n, root, ROR #2 * i | |
SUBHS n, n, root, ROR #2 * i | |
ADC root, offset, root, LSL #1 | |
] | |
BIC root, root, #3 << 30 ; for rounding add: CMP n, root ADC root, #1 | |
This routine takes 3 * 16 + 3 = 51 cycles for a root of a 32 bit integer. | |
Of course, there are many other possibilities. The routine can be used with | |
some extra code which uses early termination if n is less than 32 bit. Also | |
fixed point versions are possible, or roots of 64 bit numbers. Even looped | |
versions are no problem: Shift in 8 or 16 bits at a time, | |
so you need only to unroll 4 or 8 times (overhead 6 cycles). | |
Wilco |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment