briansmith · August 6, 2015 22:50
diff --git a/multiplication.txt b/multiplication.txt
 Copyright 2015 Brian Smith.




 One Way to do Multi-precision Multiplication
 ============================================




 Introduction
 ------------

 This is not the only, or even the best, way to do multiplication. However, this
 is the way that the multiplication in GFp is currently done.

 When implementing elliptic curve cryptography for n-bit curves, we frequently
 have to multiply two n-bit numbers together. The numbers are much too large
 to fit into a single word, so we cannot simply use the processor's mul
 instruction to do the multiplication. Instead, we have to break the values into
 words, do the multiplication word-by-word, and then sum things up. The words of
 a multi-precision number are called "limbs." (The GMP project's documentation
 explains why the term "limbs" is used.) We store the multi-precision numbers as
 arrays of limbs, with the least significant limb first. For example, the P-384
 curve on a 32-bit processor requires 12 limbs per value. All the examples in
 this document are for the P-256 curve on a 64-bit processor. We denote the
 multiplication of two such values as:

             wxyz
         *   dcba
         --------
         ghijklmn

 |w| and |d| are the most significant limbs of the operands, |z| and |a| are the
 least significant limbs of the operands, |g| is the most significant limb of
 the result, and |n| is the least significant limb of the result. The result has
 twice as many limbs as the operands, but it will be reduced modulo |q| or |n|
 so that it is the same size as the original operands.




 Row-wise Schoolbook Multiplication
 ----------------------------------

 GFp multiplies using a variant of the row-wise schoolbook method. As the name
 implies, this is the same method that children usually learn in school:

                                    w  x  y  z
                                *   d  c  b  a
                                --------------
                                   wa xa ya za
                        +       wb xb yb zb
                        +    wc xc yc zc
                        + wd xd yd zd

 First, the top operand is multiplied by the least significant digit of the
 second operand. Then, the top operand is multiplied by the second-least
 significant digit of the second operand, etc. All of these products are then
 summed together. The diagram makes it look easy. However, the diagram is very
 misleading because it makes all the products (|wd|, |yb|, |za|, etc.) look like
 they fit into single words. In fact, each of those products is two words, and
 that makes things considerably more complicated.




 Accumulating a Carry Bit in the Low-order Limb of a Sum
 -------------------------------------------------------

 Here are the sums of two maximum-value limbs of size 64 bits, 32 bits, and 16
 bits, with the high-order and low-order limb of each sum separated with a
 space:

      0xFFFFFFFFFFFFFFFFFF      0xFFFFFFFF      0xFFFF
    + 0xFFFFFFFFFFFFFFFFFF    + 0xFFFFFFFF    + 0xFFFF
    ----------------------    ------------    --------
    0x1 FFFFFFFFFFFFFFFFFE    0x1 FFFFFFFE    0x1 FFFE

 Notice that the sums are one bit larger than the inputs. This extra bit is the
 "carry bit" and dealing with it is source of most of the complication in
 implementing multi-precision multiplication efficiently.

 Let's define some notation. A sum (a + b) that results in a carry is denoted:

          a
        + b
    c <----
          r

 A sum (a + b) that is guaranteed to NOT result in a carry is denoted:

          a
        + b
        ---
          r

 In both cases, the low-order limb of the sum is stored into |r|.

 Notice from the sums of maximum-value limbs above that the low-order limb of
 each sum is always at least one less than the maximum value of a limb. This
 allows us to add a single carry bit |c| to the low-order limb of a sum |s|
 of two limbs without causing a carry:

          c
        + s
        ---
          r

 In particular, this means that we can always add a carry bit and two full-sized
 limbs together, generating at most one carry bit:

          c
        + a
        + b
    c <----
          r

 (This explains why add-with-carry instructions like the x86/x64 adc instruction
 work when there is only one carry bit to set on output.)




 Accumulating a Carry Bit in the High-order Limb of a Product
 ------------------------------------------------------------

 Here are the product values for the multiplication of two maximum-value limbs
 of size 64 bits, 32 bits, and 16 bits, with the high-order and low-order limb
 of each product separated with a space:

                      0xFFFFFFFFFFFFFFFFFF           0xFFFFFFFF       0xFFFF
   *                  0xFFFFFFFFFFFFFFFFFF  *        0xFFFFFFFF  *    0xFFFF
   ---------------------------------------  -------------------  -----------
   0xFFFFFFFFFFFFFFFFFE 000000000000000001  0xFFFFFFFE 00000001  0xFFFE 0001

 This shows that it is possible to add 1 to the value of the high order limb of
 the double-precision product of two single-limb operands without causing a
 carry. In particular, adding a carry bit to the high-order limb of a
 multiplication will not cause a subsequent carry.




 The First Row
 -------------

 Let's have |ab[lo]| and |ab[hi]| denote, respectively, the low-order limb and
 high-order limb of the product of single-limb values |a| and |b|.

 Now, let's consider the case of a multiplication of a multi-limb value |wxyz|
 with with a single-word operand |a|, where |w| is the high-order limb and |z|
 is the low-order limb. This is the first step of a full multi-precision
 multiplication:

                             w          x          y          z
           *                                                  a
           ----------------------------------------------------
                                                         za[lo]
                                              za[hi]     ------
                                            + ya[lo]         r0
                                        c <---------
                                 + ya[hi]         r1
                                 + xa[lo]
                             c <---------
                      + xa[hi]         r2
                      + wa[lo]
                  c <---------
           + wa[hi]         r3
           --------
                 r4

 The diagram is a time series, with time starting at the top and ending at the
 bottom. Thus, for example, the multiplication |z*a| that results in |za[hi]|
 and |za[lo]| is done first and the addition that results in |r4| is done last.
 The values |r4| through |r0| contain the result of the multiplication, with
 |r4| being the high-order limb and |r0| being the low-order limb.

 The above diagram is oversimplified though, because it does not show how the
 multiply instructions are to be scheduled. On x86 and x64, at least, the normal
 mul, add, and adc instructions all reset the carry bit of the flags register.
 This means the exact ordering of all the operations is important. To account
 for this, in the following diagrams, we also denote when each multiply should
 be done, and precisely how to save the carry bit before it s destroyed.

 The issuance of the multiplication instruction to multiply |w| and |a| is
 denoted:

   mul w*a

 When the variable |wa[hi]| holds the high-order limb of the product of two
 single limbs |w| and |a|, incrementing its value by the value of a carry bit
 |c| is denoted by:

   wa[hi] += c

 The addition will not cause a carry for the reason described earlier.

 This is the same multiplication of |wxyz| by |a| as in the previous diagram,
 with the instruction scheduling made explicit:

                             w            x            y          z
             *                                                    a
             ------------------------------------------------------
                                                            mul z*a

                                                             za[lo]
                                                             ------
                                                                 r0
                            +------------------------------+
                            |                              |
                            |                    mul y*a   |
                            |                              |
                            |                     za[hi]   |
                            |                   + ya[lo]   |
                            |             c <-----------   |
                            |                         r1   |
                            |   ya[hi] += c                |
                            |                              |
                            +------------------------------+

               +------------------------------+
               |                              |
               |                    mul x*a   |
               |                              |
               |                     ya[hi]   |
               |                   + xa[lo]   |
               |             c <-----------   |
               |                         r2   |
               |   xa[hi] += c                |
               |                              |
               +------------------------------+

                       mul w*a

                        xa[hi]
                      + wa[lo]
                c <-----------
         + wa[hi]           r3
         --------
               r4

 In the diagram, the steps in the two boxes are the same. The expansion of the
 to logic to work with values of a larger number of limbs just requires
 replicating those middle steps as necessary--i.e. the steps in the boxes are
 the ones that would go into a loop. The steps preceding the repeating steps are
 different only because they don't include a carry bit in the addition, while
 the section after the repeating steps differs only because it doesn't include a
 low-order limb in the final addition.




 Subsequent Rows
 ---------------

 After the first row multiplication, subsequent multiplications will have their
 product added to an accumulator. (Ignore the fact that some kind of shifting by
 some number of place values needs to take place before the addition; it isn't
 relevant for now, and in a good implementation of Montgomery multiplication,
 the shifting is all implicit.)

                   r4           r3           r2           r1         r0
                      +=      (  w            x            y          z
                               *                                      b )
       ----------------------------------------------------------------
                                                                mul z*b

                                                                     r0
                                                               + zb[lo]
                                                           c <---------
                                                                     r0
                                                 zb[hi] += c

                                                     mul y*b

                                                      yb[lo]
                                                    + zb[hi]
                                              c <-----------
                                                         tmp

                             +---------------------------------+
                             |                                 |
                             |      yb[hi] += c                |
                             |                                 |
                             |          mul x*b                |
                             |                            r1   |
                             |                      +    tmp   |
                             |                c <-----------   |
                             |         + xb[lo]           r1   |
                             |         + yb[hi]                |
                             |   c <-----------                |
                             |              tmp                |
                             |                                 |
                             +---------------------------------+

                +---------------------------------+
                |                                 |
                |      xb[hi] += c                |
                |                                 |
                |          mul w*b                |
                |                            r2   |
                |                      +    tmp   |
                |                c <-----------   |
                |         + wb[lo]           r2   |
                |         + xb[hi]                |
                |   c <-----------                |
                |              tmp                |
                |                                 |
                +---------------------------------+

          wb[hi] += c
                                r3
                          +    tmp
                    c <-----------
             +     r4           r3
             + wb[hi]
       c <-----------
    ----           r4
      rc

 Again, the boxes indicate which steps would appear in a loop. The variable |rc|
 accounts for the extra highest-order carry bits that accumulate as the sums are
 done.




 Further Considerations
 ----------------------

 The above diagrams cannot be translated directly into x86/x64 assembly language
 because it uses three-operand arithmetic while x86/x64 assembly language uses
 two-operand arithmetic. However, the translation from three-operand arithmetic
 to two-operand arithmetic is pretty straightforward.

 TODO: OpenSSL and others have gotten this wrong multiple times, by not
 accounting for an almost-never-occurring carry. It seems likely that we
 currently suffer from the same issue. We should figure out what it is and fix
 it.

 TODO: It may be possible to optimize such addition chains further by taking
 advantage of the fact that certain combinations of the sums of the high-order
 limbs of products will not generate multiple carry bits, as mentioned in this
 Intel paper:

    http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
	Copyright 2015 Brian Smith.




	One Way to do Multi-precision Multiplication
	============================================




	Introduction
	------------

	This is not the only, or even the best, way to do multiplication. However, this
	is the way that the multiplication in GFp is currently done.

	When implementing elliptic curve cryptography for n-bit curves, we frequently
	have to multiply two n-bit numbers together. The numbers are much too large
	to fit into a single word, so we cannot simply use the processor's mul
	instruction to do the multiplication. Instead, we have to break the values into
	words, do the multiplication word-by-word, and then sum things up. The words of
	a multi-precision number are called "limbs." (The GMP project's documentation
	explains why the term "limbs" is used.) We store the multi-precision numbers as
	arrays of limbs, with the least significant limb first. For example, the P-384
	curve on a 32-bit processor requires 12 limbs per value. All the examples in
	this document are for the P-256 curve on a 64-bit processor. We denote the
	multiplication of two such values as:

	wxyz
	* dcba
	--------
	ghijklmn

	\|w\| and \|d\| are the most significant limbs of the operands, \|z\| and \|a\| are the
	least significant limbs of the operands, \|g\| is the most significant limb of
	the result, and \|n\| is the least significant limb of the result. The result has
	twice as many limbs as the operands, but it will be reduced modulo \|q\| or \|n\|
	so that it is the same size as the original operands.




	Row-wise Schoolbook Multiplication
	----------------------------------

	GFp multiplies using a variant of the row-wise schoolbook method. As the name
	implies, this is the same method that children usually learn in school:

	w x y z
	* d c b a
	--------------
	wa xa ya za
	+ wb xb yb zb
	+ wc xc yc zc
	+ wd xd yd zd

	First, the top operand is multiplied by the least significant digit of the
	second operand. Then, the top operand is multiplied by the second-least
	significant digit of the second operand, etc. All of these products are then
	summed together. The diagram makes it look easy. However, the diagram is very
	misleading because it makes all the products (\|wd\|, \|yb\|, \|za\|, etc.) look like
	they fit into single words. In fact, each of those products is two words, and
	that makes things considerably more complicated.




	Accumulating a Carry Bit in the Low-order Limb of a Sum
	-------------------------------------------------------

	Here are the sums of two maximum-value limbs of size 64 bits, 32 bits, and 16
	bits, with the high-order and low-order limb of each sum separated with a
	space:

	0xFFFFFFFFFFFFFFFFFF 0xFFFFFFFF 0xFFFF
	+ 0xFFFFFFFFFFFFFFFFFF + 0xFFFFFFFF + 0xFFFF
	---------------------- ------------ --------
	0x1 FFFFFFFFFFFFFFFFFE 0x1 FFFFFFFE 0x1 FFFE

	Notice that the sums are one bit larger than the inputs. This extra bit is the
	"carry bit" and dealing with it is source of most of the complication in
	implementing multi-precision multiplication efficiently.

	Let's define some notation. A sum (a + b) that results in a carry is denoted:

	a
	+ b
	c <----
	r

	A sum (a + b) that is guaranteed to NOT result in a carry is denoted:

	a
	+ b
	---
	r

	In both cases, the low-order limb of the sum is stored into \|r\|.

	Notice from the sums of maximum-value limbs above that the low-order limb of
	each sum is always at least one less than the maximum value of a limb. This
	allows us to add a single carry bit \|c\| to the low-order limb of a sum \|s\|
	of two limbs without causing a carry:

	c
	+ s
	---
	r

	In particular, this means that we can always add a carry bit and two full-sized
	limbs together, generating at most one carry bit:

	c
	+ a
	+ b
	c <----
	r

	(This explains why add-with-carry instructions like the x86/x64 adc instruction
	work when there is only one carry bit to set on output.)




	Accumulating a Carry Bit in the High-order Limb of a Product
	------------------------------------------------------------

	Here are the product values for the multiplication of two maximum-value limbs
	of size 64 bits, 32 bits, and 16 bits, with the high-order and low-order limb
	of each product separated with a space:

	0xFFFFFFFFFFFFFFFFFF 0xFFFFFFFF 0xFFFF
	* 0xFFFFFFFFFFFFFFFFFF * 0xFFFFFFFF * 0xFFFF
	--------------------------------------- ------------------- -----------
	0xFFFFFFFFFFFFFFFFFE 000000000000000001 0xFFFFFFFE 00000001 0xFFFE 0001

	This shows that it is possible to add 1 to the value of the high order limb of
	the double-precision product of two single-limb operands without causing a
	carry. In particular, adding a carry bit to the high-order limb of a
	multiplication will not cause a subsequent carry.




	The First Row
	-------------

	Let's have \|ab[lo]\| and \|ab[hi]\| denote, respectively, the low-order limb and
	high-order limb of the product of single-limb values \|a\| and \|b\|.

	Now, let's consider the case of a multiplication of a multi-limb value \|wxyz\|
	with with a single-word operand \|a\|, where \|w\| is the high-order limb and \|z\|
	is the low-order limb. This is the first step of a full multi-precision
	multiplication:

	w x y z
	* a
	----------------------------------------------------
	za[lo]
	za[hi] ------
	+ ya[lo] r0
	c <---------
	+ ya[hi] r1
	+ xa[lo]
	c <---------
	+ xa[hi] r2
	+ wa[lo]
	c <---------
	+ wa[hi] r3
	--------
	r4

	The diagram is a time series, with time starting at the top and ending at the
	bottom. Thus, for example, the multiplication \|z*a\| that results in \|za[hi]\|
	and \|za[lo]\| is done first and the addition that results in \|r4\| is done last.
	The values \|r4\| through \|r0\| contain the result of the multiplication, with
	\|r4\| being the high-order limb and \|r0\| being the low-order limb.

	The above diagram is oversimplified though, because it does not show how the
	multiply instructions are to be scheduled. On x86 and x64, at least, the normal
	mul, add, and adc instructions all reset the carry bit of the flags register.
	This means the exact ordering of all the operations is important. To account
	for this, in the following diagrams, we also denote when each multiply should
	be done, and precisely how to save the carry bit before it s destroyed.

	The issuance of the multiplication instruction to multiply \|w\| and \|a\| is
	denoted:

	mul w*a

	When the variable \|wa[hi]\| holds the high-order limb of the product of two
	single limbs \|w\| and \|a\|, incrementing its value by the value of a carry bit
	\|c\| is denoted by:

	wa[hi] += c

	The addition will not cause a carry for the reason described earlier.

	This is the same multiplication of \|wxyz\| by \|a\| as in the previous diagram,
	with the instruction scheduling made explicit:

	w x y z
	* a
	------------------------------------------------------
	mul z*a

	za[lo]
	------
	r0
	+------------------------------+
	\| \|
	\| mul y*a \|
	\| \|
	\| za[hi] \|
	\| + ya[lo] \|
	\| c <----------- \|
	\| r1 \|
	\| ya[hi] += c \|
	\| \|
	+------------------------------+

	+------------------------------+
	\| \|
	\| mul x*a \|
	\| \|
	\| ya[hi] \|
	\| + xa[lo] \|
	\| c <----------- \|
	\| r2 \|
	\| xa[hi] += c \|
	\| \|
	+------------------------------+

	mul w*a

	xa[hi]
	+ wa[lo]
	c <-----------
	+ wa[hi] r3
	--------
	r4

	In the diagram, the steps in the two boxes are the same. The expansion of the
	to logic to work with values of a larger number of limbs just requires
	replicating those middle steps as necessary--i.e. the steps in the boxes are
	the ones that would go into a loop. The steps preceding the repeating steps are
	different only because they don't include a carry bit in the addition, while
	the section after the repeating steps differs only because it doesn't include a
	low-order limb in the final addition.




	Subsequent Rows
	---------------

	After the first row multiplication, subsequent multiplications will have their
	product added to an accumulator. (Ignore the fact that some kind of shifting by
	some number of place values needs to take place before the addition; it isn't
	relevant for now, and in a good implementation of Montgomery multiplication,
	the shifting is all implicit.)

	r4 r3 r2 r1 r0
	+= ( w x y z
	* b )
	----------------------------------------------------------------
	mul z*b

	r0
	+ zb[lo]
	c <---------
	r0
	zb[hi] += c

	mul y*b

	yb[lo]
	+ zb[hi]
	c <-----------
	tmp

	+---------------------------------+
	\| \|
	\| yb[hi] += c \|
	\| \|
	\| mul x*b \|
	\| r1 \|
	\| + tmp \|
	\| c <----------- \|
	\| + xb[lo] r1 \|
	\| + yb[hi] \|
	\| c <----------- \|
	\| tmp \|
	\| \|
	+---------------------------------+

	+---------------------------------+
	\| \|
	\| xb[hi] += c \|
	\| \|
	\| mul w*b \|
	\| r2 \|
	\| + tmp \|
	\| c <----------- \|
	\| + wb[lo] r2 \|
	\| + xb[hi] \|
	\| c <----------- \|
	\| tmp \|
	\| \|
	+---------------------------------+

	wb[hi] += c
	r3
	+ tmp
	c <-----------
	+ r4 r3
	+ wb[hi]
	c <-----------
	---- r4
	rc

	Again, the boxes indicate which steps would appear in a loop. The variable \|rc\|
	accounts for the extra highest-order carry bits that accumulate as the sums are
	done.




	Further Considerations
	----------------------

	The above diagrams cannot be translated directly into x86/x64 assembly language
	because it uses three-operand arithmetic while x86/x64 assembly language uses
	two-operand arithmetic. However, the translation from three-operand arithmetic
	to two-operand arithmetic is pretty straightforward.

	TODO: OpenSSL and others have gotten this wrong multiple times, by not
	accounting for an almost-never-occurring carry. It seems likely that we
	currently suffer from the same issue. We should figure out what it is and fix
	it.

	TODO: It may be possible to optimize such addition chains further by taking
	advantage of the fact that certain combinations of the sums of the high-order
	limbs of products will not generate multiple carry bits, as mentioned in this
	Intel paper:

	http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf