Created
August 6, 2015 22:50
-
-
Save briansmith/22342243c8521dee93f8 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Copyright 2015 Brian Smith. | |
One Way to do Multi-precision Multiplication | |
============================================ | |
Introduction | |
------------ | |
This is not the only, or even the best, way to do multiplication. However, this | |
is the way that the multiplication in GFp is currently done. | |
When implementing elliptic curve cryptography for n-bit curves, we frequently | |
have to multiply two n-bit numbers together. The numbers are much too large | |
to fit into a single word, so we cannot simply use the processor's mul | |
instruction to do the multiplication. Instead, we have to break the values into | |
words, do the multiplication word-by-word, and then sum things up. The words of | |
a multi-precision number are called "limbs." (The GMP project's documentation | |
explains why the term "limbs" is used.) We store the multi-precision numbers as | |
arrays of limbs, with the least significant limb first. For example, the P-384 | |
curve on a 32-bit processor requires 12 limbs per value. All the examples in | |
this document are for the P-256 curve on a 64-bit processor. We denote the | |
multiplication of two such values as: | |
wxyz | |
* dcba | |
-------- | |
ghijklmn | |
|w| and |d| are the most significant limbs of the operands, |z| and |a| are the | |
least significant limbs of the operands, |g| is the most significant limb of | |
the result, and |n| is the least significant limb of the result. The result has | |
twice as many limbs as the operands, but it will be reduced modulo |q| or |n| | |
so that it is the same size as the original operands. | |
Row-wise Schoolbook Multiplication | |
---------------------------------- | |
GFp multiplies using a variant of the row-wise schoolbook method. As the name | |
implies, this is the same method that children usually learn in school: | |
w x y z | |
* d c b a | |
-------------- | |
wa xa ya za | |
+ wb xb yb zb | |
+ wc xc yc zc | |
+ wd xd yd zd | |
First, the top operand is multiplied by the least significant digit of the | |
second operand. Then, the top operand is multiplied by the second-least | |
significant digit of the second operand, etc. All of these products are then | |
summed together. The diagram makes it look easy. However, the diagram is very | |
misleading because it makes all the products (|wd|, |yb|, |za|, etc.) look like | |
they fit into single words. In fact, each of those products is two words, and | |
that makes things considerably more complicated. | |
Accumulating a Carry Bit in the Low-order Limb of a Sum | |
------------------------------------------------------- | |
Here are the sums of two maximum-value limbs of size 64 bits, 32 bits, and 16 | |
bits, with the high-order and low-order limb of each sum separated with a | |
space: | |
0xFFFFFFFFFFFFFFFFFF 0xFFFFFFFF 0xFFFF | |
+ 0xFFFFFFFFFFFFFFFFFF + 0xFFFFFFFF + 0xFFFF | |
---------------------- ------------ -------- | |
0x1 FFFFFFFFFFFFFFFFFE 0x1 FFFFFFFE 0x1 FFFE | |
Notice that the sums are one bit larger than the inputs. This extra bit is the | |
"carry bit" and dealing with it is source of most of the complication in | |
implementing multi-precision multiplication efficiently. | |
Let's define some notation. A sum (a + b) that results in a carry is denoted: | |
a | |
+ b | |
c <---- | |
r | |
A sum (a + b) that is guaranteed to NOT result in a carry is denoted: | |
a | |
+ b | |
--- | |
r | |
In both cases, the low-order limb of the sum is stored into |r|. | |
Notice from the sums of maximum-value limbs above that the low-order limb of | |
each sum is always at least one less than the maximum value of a limb. This | |
allows us to add a single carry bit |c| to the low-order limb of a sum |s| | |
of two limbs without causing a carry: | |
c | |
+ s | |
--- | |
r | |
In particular, this means that we can always add a carry bit and two full-sized | |
limbs together, generating at most one carry bit: | |
c | |
+ a | |
+ b | |
c <---- | |
r | |
(This explains why add-with-carry instructions like the x86/x64 adc instruction | |
work when there is only one carry bit to set on output.) | |
Accumulating a Carry Bit in the High-order Limb of a Product | |
------------------------------------------------------------ | |
Here are the product values for the multiplication of two maximum-value limbs | |
of size 64 bits, 32 bits, and 16 bits, with the high-order and low-order limb | |
of each product separated with a space: | |
0xFFFFFFFFFFFFFFFFFF 0xFFFFFFFF 0xFFFF | |
* 0xFFFFFFFFFFFFFFFFFF * 0xFFFFFFFF * 0xFFFF | |
--------------------------------------- ------------------- ----------- | |
0xFFFFFFFFFFFFFFFFFE 000000000000000001 0xFFFFFFFE 00000001 0xFFFE 0001 | |
This shows that it is possible to add 1 to the value of the high order limb of | |
the double-precision product of two single-limb operands without causing a | |
carry. In particular, adding a carry bit to the high-order limb of a | |
multiplication will not cause a subsequent carry. | |
The First Row | |
------------- | |
Let's have |ab[lo]| and |ab[hi]| denote, respectively, the low-order limb and | |
high-order limb of the product of single-limb values |a| and |b|. | |
Now, let's consider the case of a multiplication of a multi-limb value |wxyz| | |
with with a single-word operand |a|, where |w| is the high-order limb and |z| | |
is the low-order limb. This is the first step of a full multi-precision | |
multiplication: | |
w x y z | |
* a | |
---------------------------------------------------- | |
za[lo] | |
za[hi] ------ | |
+ ya[lo] r0 | |
c <--------- | |
+ ya[hi] r1 | |
+ xa[lo] | |
c <--------- | |
+ xa[hi] r2 | |
+ wa[lo] | |
c <--------- | |
+ wa[hi] r3 | |
-------- | |
r4 | |
The diagram is a time series, with time starting at the top and ending at the | |
bottom. Thus, for example, the multiplication |z*a| that results in |za[hi]| | |
and |za[lo]| is done first and the addition that results in |r4| is done last. | |
The values |r4| through |r0| contain the result of the multiplication, with | |
|r4| being the high-order limb and |r0| being the low-order limb. | |
The above diagram is oversimplified though, because it does not show how the | |
multiply instructions are to be scheduled. On x86 and x64, at least, the normal | |
mul, add, and adc instructions all reset the carry bit of the flags register. | |
This means the exact ordering of all the operations is important. To account | |
for this, in the following diagrams, we also denote when each multiply should | |
be done, and precisely how to save the carry bit before it s destroyed. | |
The issuance of the multiplication instruction to multiply |w| and |a| is | |
denoted: | |
mul w*a | |
When the variable |wa[hi]| holds the high-order limb of the product of two | |
single limbs |w| and |a|, incrementing its value by the value of a carry bit | |
|c| is denoted by: | |
wa[hi] += c | |
The addition will not cause a carry for the reason described earlier. | |
This is the same multiplication of |wxyz| by |a| as in the previous diagram, | |
with the instruction scheduling made explicit: | |
w x y z | |
* a | |
------------------------------------------------------ | |
mul z*a | |
za[lo] | |
------ | |
r0 | |
+------------------------------+ | |
| | | |
| mul y*a | | |
| | | |
| za[hi] | | |
| + ya[lo] | | |
| c <----------- | | |
| r1 | | |
| ya[hi] += c | | |
| | | |
+------------------------------+ | |
+------------------------------+ | |
| | | |
| mul x*a | | |
| | | |
| ya[hi] | | |
| + xa[lo] | | |
| c <----------- | | |
| r2 | | |
| xa[hi] += c | | |
| | | |
+------------------------------+ | |
mul w*a | |
xa[hi] | |
+ wa[lo] | |
c <----------- | |
+ wa[hi] r3 | |
-------- | |
r4 | |
In the diagram, the steps in the two boxes are the same. The expansion of the | |
to logic to work with values of a larger number of limbs just requires | |
replicating those middle steps as necessary--i.e. the steps in the boxes are | |
the ones that would go into a loop. The steps preceding the repeating steps are | |
different only because they don't include a carry bit in the addition, while | |
the section after the repeating steps differs only because it doesn't include a | |
low-order limb in the final addition. | |
Subsequent Rows | |
--------------- | |
After the first row multiplication, subsequent multiplications will have their | |
product added to an accumulator. (Ignore the fact that some kind of shifting by | |
some number of place values needs to take place before the addition; it isn't | |
relevant for now, and in a good implementation of Montgomery multiplication, | |
the shifting is all implicit.) | |
r4 r3 r2 r1 r0 | |
+= ( w x y z | |
* b ) | |
---------------------------------------------------------------- | |
mul z*b | |
r0 | |
+ zb[lo] | |
c <--------- | |
r0 | |
zb[hi] += c | |
mul y*b | |
yb[lo] | |
+ zb[hi] | |
c <----------- | |
tmp | |
+---------------------------------+ | |
| | | |
| yb[hi] += c | | |
| | | |
| mul x*b | | |
| r1 | | |
| + tmp | | |
| c <----------- | | |
| + xb[lo] r1 | | |
| + yb[hi] | | |
| c <----------- | | |
| tmp | | |
| | | |
+---------------------------------+ | |
+---------------------------------+ | |
| | | |
| xb[hi] += c | | |
| | | |
| mul w*b | | |
| r2 | | |
| + tmp | | |
| c <----------- | | |
| + wb[lo] r2 | | |
| + xb[hi] | | |
| c <----------- | | |
| tmp | | |
| | | |
+---------------------------------+ | |
wb[hi] += c | |
r3 | |
+ tmp | |
c <----------- | |
+ r4 r3 | |
+ wb[hi] | |
c <----------- | |
---- r4 | |
rc | |
Again, the boxes indicate which steps would appear in a loop. The variable |rc| | |
accounts for the extra highest-order carry bits that accumulate as the sums are | |
done. | |
Further Considerations | |
---------------------- | |
The above diagrams cannot be translated directly into x86/x64 assembly language | |
because it uses three-operand arithmetic while x86/x64 assembly language uses | |
two-operand arithmetic. However, the translation from three-operand arithmetic | |
to two-operand arithmetic is pretty straightforward. | |
TODO: OpenSSL and others have gotten this wrong multiple times, by not | |
accounting for an almost-never-occurring carry. It seems likely that we | |
currently suffer from the same issue. We should figure out what it is and fix | |
it. | |
TODO: It may be possible to optimize such addition chains further by taking | |
advantage of the fact that certain combinations of the sums of the high-order | |
limbs of products will not generate multiple carry bits, as mentioned in this | |
Intel paper: | |
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment