OpenCL Multiprecision
Eric Bainville - Jan 2010Implementing the SSA in OpenCL, part I
The algorithm described in the previous page transforms N coefficients of length P words in base B. Coefficients are processed modulo BP+1. N must be a power of 2, and P a multiple of N/2, so ω can be taken as a power of B. The input number to multiply can then use (N/2).(P/2-1) words (a little less than 1/4 of the words as we have seen in the previous page). The following table gives the size of the number in words for various values of N and P. Each word will store between 24 and 30 bits, depending on the normalization strategy we adopt for the redundant representation. In the table we assumed 24 bits per word: B=224.
N P X words X bits Buffer (MiB) 1024 512 130,560 3,133,440 2 1024 1024 261,632 6,279,168 4 1024 2048 523,776 12,570,624 8 2048 1024 523,264 12,558,336 8 2048 2048 1,047,552 25,141,248 16 2048 4096 2,096,128 50,307,072 32 4096 2048 2,095,104 50,282,496 32 4096 4096 4,192,256 100,614,144 64
The initial digits are in range -M..+M, then after p butterflies the bound is -2p.M..2p.M. Normalization of the numbers is not required at each step but only when the next step would exceed the capacity of an int, or after the final step to ensure the final digits are normalized.
One thread per butterfly
In this first implementation, we will compute all butterflies in parallel for each recursion level. The output will be stored in another buffer, and then the buffers swapped for the next step (similar to the ping-pong technique used in OpenGL GPGPU).
The combine operation (butterfly) is performed sequentially in a single thread:
// Combine two numbers X[P] and Y[P] into U[P] and V[P]. // The numbers are computed mod BASE^P+1. // U = X + Y * BASE^SHIFT // V = X - Y * BASE^SHIFT // SHIFT is given in words and shall be >=0. // The digits in the result are not range-normalized (returns the raw sums). void combine(int p,int shift, __global const int * x,__global const int * y, __global int * u,__global int * v) { int j = shift; int sign = -1; while (j >= p) { j -= p; sign = -sign; } j = p - j; // J = P - SHIFT % P is the index in Y matching X[0] for (int i=0;i<p;i++,j++) { if (j == p) { j = 0; sign = -sign; } // wrap int xx = x[i]; int yy = y[j]; u[i] = (sign>0)?(xx+yy):(xx-yy); v[i] = (sign<0)?(xx+yy):(xx-yy); } }
The following kernel must be executed by N/2 threads for each level of recursion. Figuring out the indices of the two input vectors and the correct shift to apply requires some bit twiddling.
// Compute N/2 butterflies in parallel, where N/2 is the number of threads, // P is the size of the N numbers stored in X and Y. // X[N*P] is the input buffer. // Y[N*P] is the output buffer. // STEP is the recursion level, from 0 to LOG2(N)-1. // Each butterly combines vectors I and I+(1<<STEP) for some I. __kernel void combine_kernel(int p,int step,__global const int * x,__global int * y) { // Get current thread id int i = get_global_id(0); // Insert a 0 bit in I at position STEP, to get the index of the first vector to combine int bit = (1<<step); int m = i & (bit-1); // M = lower STEP-1 bits of I i = ((i-m)<<1)+m; int j = i | bit; // index of the second vector to combine // Combine values. The power of ω corresponds to // P>>STEP, and shall be multiplied by the index M // of the butterfly in its group. combine(p,(p>>step)*m,x+i*p,x+j*p,y+i*p,y+j*p); }
The timings of a full FFT using this code on a NVidia GTX285 (64-bit Linux, driver 195.30) are given in the following table. It only includes the log2(N) calls to the kernel, and shows the min time reached for various sizes of the working group. Normalization steps are not included.
N P FFT (ms) 1024 512 10 1024 1024 17 1024 2048 32 2048 1024 29 2048 2048 61 2048 4096 174 4096 2048 125 4096 4096 357
One butterfly reads the entire input buffer from global memory, and writes the entire output buffer back
to global memory. For N=4096 and P=4096, we read+write 12 times (log2(N)=12) the 64 MiB buffers. 357 ms
correspond to 2.1 GB/s of write throughput.
The combine kernel is very likely memory bound (see the
One core of a Core Quad Q9600 CPU runs a 100 Mbit product in approximately 1.3 s (using MPIR 1.3.0-RC4 or GMP). It corresponds to 3 FFT and the N products of P-digit numbers, meaning the FFT on the CPU takes less than 433 ms.
There are two obvious directions to optimize the throughput: collapse two or more butterflies into one single step (less memory access), and subdivide each butterfly into several threads (less serial processing).
In the next page, we will test the second option, and use one thread per word.
OpenCL Multiprecision : Schönhage-Strassen | Top of Page | OpenCL Multiprecision : OpenCL SSA II |