OpenCL Multiprecision

Eric Bainville - Jan 2010

Some multiplication algorithms

Introduction

The product of two large numbers is used in virtually all higher level operations, and consequently its performance is crucial. As we have seen in the previous pages, the CPU will handle faster the case of numbers up to one million 64-bit digits, so we must implement the asymptotically fast methods based on FFT. The best optimized CPU multiplication code is probably in the MPIR and GMP libraries. It uses a mix of Karatsuba, Toom-Cook, and Schönhage-Strassen algorithms depending on the size of the inputs.

Given a number p stored on N digits in base B: p=Σp_i*Bⁱ, with i in 0..N-1, and p_i in -M..M. We associate to p the polynomial of degree N-1 P(X)=Σp_i*Xⁱ, i=0..N-1. Then we have p=P(B).

The product of two numbers p and q is then done by first multiplying the corresponding polynomials R(X)=P(X)*Q(X), and then normalize the coefficients of R(B). For two numbers on N digits with max value M, the max value for the central coefficient (degree N-1) of R(X) is N*M². R has 2N-1 coefficients (degree 2N-2), and the product p*q has 2N digits when M≤B-1.

Example with B=100 and N=4

In base B=100, the numbers p=75978566 and q=15439875 are represented each with N=4 digits: p_0..3=(66,85,97,75) and q_0..3=(75,98,43,15). The corresponding polynomials are:

P(X) = 75*X³ + 97*X² + 85*X + 66
Q(X) = 15*X³ + 43*X² + 98*X + 75

To compute p*q using polynomials, we first compute R(X)=P(X)*Q(X):

R(X)=1125*X⁶+4680*X⁵+12796*X⁴+19776*X³+18443*X²+12843*X+4950

And then normalize the coefficients from degree 0 to 6, propagating the "carry" in base B=100:

degree
0                    49 50
1               1 28 43
2            1 84 43
3         1 97 76
4      1 27 96
5     46 80
6  11 25
---------------------------
   11 73 09 95 61 71 92 50

Finally we obtain the expected result: p*q=1173099561719250.

Representations of a polynomial

All usual methods to compute efficiently the product of two polynomials are based on the following fact: a polynomial of degree N-1 (i.e. with N coefficients) is uniquely determined by its value at N distincts points.

In other terms, we can equivalently represent a polynomial P of degree N-1 by the sequence of its N coefficients (p₀,p₁,...,p_N-1) or by the sequence of the N values (P(x₀),P(x₁),...,P(x_N-1)) it takes for a set of distinct points x_i.

This representation is interesting to compute the product, because (P*Q)(x_i)=P(x_i)*Q(x_i): the value of the product at point x_i is the product of the two values at x_i.

Note that in this case, because the product has 2N-1 coefficients, the input polynomials must be evaluated for at least at 2N-1 points, more than needed to represent each polynomial.

The product of two polynomials is then computed in three steps:
1. evaluate: get the value of both inputs for all points x_i
2. compute the output value at all points x_i: R(x_i)=P(x_i)*Q(x_i)
3. interpolate: get the coefficients of the result (get the interpolating polynomial)

Provided steps 1 and 3 can be computed in less than O(N²), we get a method faster than the trivial polynomial product.

Another note. I didn't specify a domain for the coefficients of the polynomials. Each algorithm has a "good" choice of the domain: integers, integers modulo n, complex numbers, polynomials,...

Evaluation and interpolation schemes

Evaluating a polynomial P=Σp_i*Xⁱ of degree N-1 at 2N-1 points x_i is linear in the polynomial coefficients p_i. The vector y, y_i=P(x_i) is the product of a 2N-1×N matrix X and the coefficients vector p: y=X*p. Row i of X is (x_i⁰,x_i¹,...,x_i^N-1). Some methods allow x_i=∞, with the convention P(∞)=p_N-1, i.e. the leading coefficient of P.

The objective is to evaluate this matrix-vector product with the lowest cost. We will see a few examples below, taken from the Wikipedia pages of the Toom-Cook and Schönhage-Strassen algorithms (links above), and the article What About Toom-Cook Matrices Optimality? by Marco Bodrato and Alberto Zanoni.

Example: 0,1,∞

This corresponds to the Karatsuba algorithm. The matrix X is:

p₀ p₁
1  0     x₀=0
1  1     x₁=1
0  1     x₂=∞

Here we have no mystery: y₀=p₀, y₁=p₀+p₁, and y₂=p₁. With x₁=-1 instead of +1 we obtain the Knuth scheme, exposed in The Art of Computer Programming: Seminumerical Algorithms, section 4.3.3.

The coefficients are obtained as the image of the vector P(x_i)*Q(x_i) by the inverse matrix (this time with all 2N-1 values):

1  0  0                     1  0  0
1  1  1   == inverse ==>   -1  1 -1
0  0  1                     0  0  1

Substituting everything, we get the Karatsuba algorithm, computing the product of P and Q with 3*mul+4*add:

r₀ = p₀ * q₀
r₂ = p₁ * q₁
r₁ = (p₀+p₁) * (q₀+q₁) - r₀ - r₂

Example: 0,+1,-1,-2,∞

This is known as the Toom-3 algorithm, used to multiply two polynomials with N=3 coefficients (degree 2). The matrix X is the following. For each row, we give the natural evaluation time.

p₀ p₁ p₂
1  0  0   x₀=0, no eval
1  1  1   x₁=+1, 2*add
1 -1  1   x₂=-1, 2*add
1 -2  4   x₃=-2, 2*add+2*shift
0  0  1   x₄=∞, no eval

The following evaluation sequence is the fastest (from the paper of Bodrato and Zanoni), with 5*add+1*shift instead of 6*add+2*shift:

s₁  1  0  0  p₀, no eval
s₂  1  0  1  p₀+p₂, 1*add
s₃  1  1  1  s₂+p₁, 1*add
s₄  1 -1  1  s₂-p₁, 1*add
s₅  1 -2  4  (s₄+p₂)*2-p₀, 2*add+1*shift
s₆  0  0  1  p₂, no eval

The interpolation matrix is:

   1    0    0    0   0
 1/2  1/3   -1  1/6  -2
  -1  1/2  1/2    0  -1
-1/2  1/6  1/2 -1/6   2
   0    0    0    0   1

The fastest interpolation sequence (Bodrato and Zanoni) can be found on the Wikipedia page, and requires 8*add+3*shift+1*div.

Example: ω⁰,ω¹,...,ω^n-1

This is known as the Discrete Fourier Transform (DFT). ω is a complex n-th root of 1, with n generally a power of two. For the sake of simplicity, let's consider in this example the specific case of n=16, and ω=e^2*I*π/16. We have ω⁴=I, ω⁸=-1, ω¹²=-I. It is used to multiply polynomials with N=8 coefficients. The X matrix is the following. To simplify the notation, each entry in the matrix is the exponent of ω, or -, for example 3 means ω³. Entry - means 0.

 k: y_k as a sum of ω^X_ki*p_i

 0: 0  0  0  0  0  0  0  0
 1: 0  1  2  3  4  5  6  7
 2: 0  2  4  6  8 10 12 14
 3: 0  3  6  9 12 15  2  5
 4: 0  4  8 12  0  4  8 12
 5: 0  5 10 15  4  9 14  3
 6: 0  6 12  2  8 14  4 10
 7: 0  7 14  5 12  3 10  1
 8: 0  8  0  8  0  8  0  8
 9: 0  9  2 11  4 13  6 15
10: 0 10  4 14  8  2 12  6
11: 0 11  6  1 12  7  2 13
12: 0 12  8  4  0 12  8  4
13: 0 13 10  7  4  1 14 11
14: 0 14 12 10  8  6  4  2
15: 0 15 14 13 12 11 10  9

The classical Cooley-Tuckey algorithm computes these 16 values recursively. It is one of the Fast Fourier Transform (FFT) algorithms to compute the DFT. Let's suppose we managed to compute the following values s_k and t_k. This is done using exactly the same algorithm on the N/2 even-numbered (s) and odd-numbered (t) coefficients, and with ω² in place of ω:

k: s_k
0: 0  -  0  -  0  -  0  -
1: 0  -  2  -  4  -  6  -
2: 0  -  4  -  8  - 12  -
3: 0  -  6  - 12  -  2  -
4: 0  -  8  -  0  -  8  -
5: 0  - 10  -  4  - 14  -
6: 0  - 12  -  8  -  4  -
7: 0  - 14  - 12  - 10  -

k: t_k
0: -  0  -  0  -  0  -  0
1: -  0  -  2  -  4  -  6
2: -  0  -  4  -  8  - 12
3: -  0  -  6  - 12  -  2
4: -  0  -  8  -  0  -  8
5: -  0  - 10  -  4  - 14
6: -  0  - 12  -  8  -  4
7: -  0  - 14  - 12  - 10

Then we can assemble these two sequences to obtain y:

y_k = s_k + ω^k t_k
y_k+8 = s_k - ω^k t_k
k = 0..7

The next page shows in more detail the Schönhage-Strassen Algorithm, using the same DFT algorithms, but on integers modulo 2^m+1.