OpenCL Multiprecision

Eric Bainville - Jan 2010

Some multiplication algorithms

Introduction

The product of two large numbers is used in virtually all higher level operations, and consequently its performance is crucial. As we have seen in the previous pages, the CPU will handle faster the case of numbers up to one million 64-bit digits, so we must implement the asymptotically fast methods based on FFT. The best optimized CPU multiplication code is probably in the MPIR and GMP libraries. It uses a mix of Karatsuba, Toom-Cook, and Schönhage-Strassen algorithms depending on the size of the inputs.

Given a number p stored on N digits in base B: p=Σpi*Bi, with i in 0..N-1, and pi in -M..M. We associate to p the polynomial of degree N-1 P(X)=Σpi*Xi, i=0..N-1. Then we have p=P(B).

The product of two numbers p and q is then done by first multiplying the corresponding polynomials R(X)=P(X)*Q(X), and then normalize the coefficients of R(B). For two numbers on N digits with max value M, the max value for the central coefficient (degree N-1) of R(X) is N*M2. R has 2N-1 coefficients (degree 2N-2), and the product p*q has 2N digits when M≤B-1.

Example with B=100 and N=4

In base B=100, the numbers p=75978566 and q=15439875 are represented each with N=4 digits: p0..3=(66,85,97,75) and q0..3=(75,98,43,15). The corresponding polynomials are:

P(X) = 75*X3 + 97*X2 + 85*X + 66
Q(X) = 15*X3 + 43*X2 + 98*X + 75

To compute p*q using polynomials, we first compute R(X)=P(X)*Q(X):

R(X)=1125*X6+4680*X5+12796*X4+19776*X3+18443*X2+12843*X+4950

And then normalize the coefficients from degree 0 to 6, propagating the "carry" in base B=100:

degree
0                    49 50
1               1 28 43
2            1 84 43
3         1 97 76
4      1 27 96
5     46 80
6  11 25
---------------------------
   11 73 09 95 61 71 92 50

Finally we obtain the expected result: p*q=1173099561719250.

Representations of a polynomial

All usual methods to compute efficiently the product of two polynomials are based on the following fact: a polynomial of degree N-1 (i.e. with N coefficients) is uniquely determined by its value at N distincts points.

In other terms, we can equivalently represent a polynomial P of degree N-1 by the sequence of its N coefficients (p0,p1,...,pN-1) or by the sequence of the N values (P(x0),P(x1),...,P(xN-1)) it takes for a set of distinct points xi.

This representation is interesting to compute the product, because (P*Q)(xi)=P(xi)*Q(xi): the value of the product at point xi is the product of the two values at xi.

Note that in this case, because the product has 2N-1 coefficients, the input polynomials must be evaluated for at least at 2N-1 points, more than needed to represent each polynomial.

The product of two polynomials is then computed in three steps:
1. evaluate: get the value of both inputs for all points xi
2. compute the output value at all points xi: R(xi)=P(xi)*Q(xi)
3. interpolate: get the coefficients of the result (get the interpolating polynomial)

Provided steps 1 and 3 can be computed in less than O(N2), we get a method faster than the trivial polynomial product.

Another note. I didn't specify a domain for the coefficients of the polynomials. Each algorithm has a "good" choice of the domain: integers, integers modulo n, complex numbers, polynomials,...

Evaluation and interpolation schemes

Evaluating a polynomial P=Σpi*Xi of degree N-1 at 2N-1 points xi is linear in the polynomial coefficients pi. The vector y, yi=P(xi) is the product of a 2N-1×N matrix X and the coefficients vector p: y=X*p. Row i of X is (xi0,xi1,...,xiN-1). Some methods allow xi=∞, with the convention P(∞)=pN-1, i.e. the leading coefficient of P.

The objective is to evaluate this matrix-vector product with the lowest cost. We will see a few examples below, taken from the Wikipedia pages of the Toom-Cook and Schönhage-Strassen algorithms (links above), and the article What About Toom-Cook Matrices Optimality? by Marco Bodrato and Alberto Zanoni.

Example: 0,1,∞

This corresponds to the Karatsuba algorithm. The matrix X is:

p0 p1
1  0     x0=0
1  1     x1=1
0  1     x2=∞

Here we have no mystery: y0=p0, y1=p0+p1, and y2=p1. With x1=-1 instead of +1 we obtain the Knuth scheme, exposed in The Art of Computer Programming: Seminumerical Algorithms, section 4.3.3.

The coefficients are obtained as the image of the vector P(xi)*Q(xi) by the inverse matrix (this time with all 2N-1 values):

1  0  0                     1  0  0
1  1  1   == inverse ==>   -1  1 -1
0  0  1                     0  0  1

Substituting everything, we get the Karatsuba algorithm, computing the product of P and Q with 3*mul+4*add:

r0 = p0 * q0
r2 = p1 * q1
r1 = (p0+p1) * (q0+q1) - r0 - r2

Example: 0,+1,-1,-2,∞

This is known as the Toom-3 algorithm, used to multiply two polynomials with N=3 coefficients (degree 2). The matrix X is the following. For each row, we give the natural evaluation time.

p0 p1 p2
1  0  0   x0=0, no eval
1  1  1   x1=+1, 2*add
1 -1  1   x2=-1, 2*add
1 -2  4   x3=-2, 2*add+2*shift
0  0  1   x4=∞, no eval

The following evaluation sequence is the fastest (from the paper of Bodrato and Zanoni), with 5*add+1*shift instead of 6*add+2*shift:

s1  1  0  0  p0, no eval
s2  1  0  1  p0+p2, 1*add
s3  1  1  1  s2+p1, 1*add
s4  1 -1  1  s2-p1, 1*add
s5  1 -2  4  (s4+p2)*2-p0, 2*add+1*shift
s6  0  0  1  p2, no eval

The interpolation matrix is:

   1    0    0    0   0
 1/2  1/3   -1  1/6  -2
  -1  1/2  1/2    0  -1
-1/2  1/6  1/2 -1/6   2
   0    0    0    0   1

The fastest interpolation sequence (Bodrato and Zanoni) can be found on the Wikipedia page, and requires 8*add+3*shift+1*div.

Example: ω01,...,ωn-1

This is known as the Discrete Fourier Transform (DFT). ω is a complex n-th root of 1, with n generally a power of two. For the sake of simplicity, let's consider in this example the specific case of n=16, and ω=e2*I*π/16. We have ω4=I, ω8=-1, ω12=-I. It is used to multiply polynomials with N=8 coefficients. The X matrix is the following. To simplify the notation, each entry in the matrix is the exponent of ω, or -, for example 3 means ω3. Entry - means 0.

 k: yk as a sum of ωXki*pi

 0: 0  0  0  0  0  0  0  0
 1: 0  1  2  3  4  5  6  7
 2: 0  2  4  6  8 10 12 14
 3: 0  3  6  9 12 15  2  5
 4: 0  4  8 12  0  4  8 12
 5: 0  5 10 15  4  9 14  3
 6: 0  6 12  2  8 14  4 10
 7: 0  7 14  5 12  3 10  1
 8: 0  8  0  8  0  8  0  8
 9: 0  9  2 11  4 13  6 15
10: 0 10  4 14  8  2 12  6
11: 0 11  6  1 12  7  2 13
12: 0 12  8  4  0 12  8  4
13: 0 13 10  7  4  1 14 11
14: 0 14 12 10  8  6  4  2
15: 0 15 14 13 12 11 10  9

The classical Cooley-Tuckey algorithm computes these 16 values recursively. It is one of the Fast Fourier Transform (FFT) algorithms to compute the DFT. Let's suppose we managed to compute the following values sk and tk. This is done using exactly the same algorithm on the N/2 even-numbered (s) and odd-numbered (t) coefficients, and with ω2 in place of ω:

k: sk
0: 0  -  0  -  0  -  0  -
1: 0  -  2  -  4  -  6  -
2: 0  -  4  -  8  - 12  -
3: 0  -  6  - 12  -  2  -
4: 0  -  8  -  0  -  8  -
5: 0  - 10  -  4  - 14  -
6: 0  - 12  -  8  -  4  -
7: 0  - 14  - 12  - 10  -
k: tk
0: -  0  -  0  -  0  -  0
1: -  0  -  2  -  4  -  6
2: -  0  -  4  -  8  - 12
3: -  0  -  6  - 12  -  2
4: -  0  -  8  -  0  -  8
5: -  0  - 10  -  4  - 14
6: -  0  - 12  -  8  -  4
7: -  0  - 14  - 12  - 10

Then we can assemble these two sequences to obtain y:

yk = sk + ωk tk
yk+8 = sk - ωk tk
k = 0..7

The next page shows in more detail the Schönhage-Strassen Algorithm, using the same DFT algorithms, but on integers modulo 2m+1.