OpenCL Multiprecision

Eric Bainville - Jan 2010

Implementing the SSA in OpenCL, part I

The algorithm described in the previous page transforms N coefficients of length P words in base B. Coefficients are processed modulo BP+1. N must be a power of 2, and P a multiple of N/2, so ω can be taken as a power of B. The input number to multiply can then use (N/2).(P/2-1) words (a little less than 1/4 of the words as we have seen in the previous page). The following table gives the size of the number in words for various values of N and P. Each word will store between 24 and 30 bits, depending on the normalization strategy we adopt for the redundant representation. In the table we assumed 24 bits per word: B=224.

 N	 P	 X words          X bits     Buffer (MiB)
1024	 512	  130,560	  3,133,440	 2
1024	1024	  261,632	  6,279,168	 4
1024	2048	  523,776	 12,570,624	 8
2048	1024	  523,264	 12,558,336	 8
2048	2048	1,047,552	 25,141,248	16
2048	4096	2,096,128	 50,307,072	32
4096	2048	2,095,104	 50,282,496	32
4096	4096	4,192,256	100,614,144	64

The initial digits are in range -M..+M, then after p butterflies the bound is -2p.M..2p.M. Normalization of the numbers is not required at each step but only when the next step would exceed the capacity of an int, or after the final step to ensure the final digits are normalized.

One thread per butterfly

In this first implementation, we will compute all butterflies in parallel for each recursion level. The output will be stored in another buffer, and then the buffers swapped for the next step (similar to the ping-pong technique used in OpenGL GPGPU).

The combine operation (butterfly) is performed sequentially in a single thread:

// Combine two numbers X[P] and Y[P] into U[P] and V[P].
// The numbers are computed mod BASE^P+1.
// U = X + Y * BASE^SHIFT
// V = X - Y * BASE^SHIFT
// SHIFT is given in words and shall be >=0.
// The digits in the result are not range-normalized (returns the raw sums).
void combine(int p,int shift,
	     __global const int * x,__global const int * y,
	     __global int * u,__global int * v)
{
  int j = shift;
  int sign = -1;
  while (j >= p) { j -= p; sign = -sign; }
  j = p - j; // J = P - SHIFT % P is the index in Y matching X[0]
  for (int i=0;i<p;i++,j++)
  {
    if (j == p) { j = 0; sign = -sign; } // wrap
    int xx = x[i];
    int yy = y[j];
    u[i] = (sign>0)?(xx+yy):(xx-yy);
    v[i] = (sign<0)?(xx+yy):(xx-yy);
  }
}

The following kernel must be executed by N/2 threads for each level of recursion. Figuring out the indices of the two input vectors and the correct shift to apply requires some bit twiddling.

// Compute N/2 butterflies in parallel, where N/2 is the number of threads,
// P is the size of the N numbers stored in X and Y.
// X[N*P] is the input buffer.
// Y[N*P] is the output buffer.
// STEP is the recursion level, from 0 to LOG2(N)-1.
// Each butterly combines vectors I and I+(1<<STEP) for some I.
__kernel void combine_kernel(int p,int step,__global const int * x,__global int * y)
{
  // Get current thread id
  int i = get_global_id(0);
  // Insert a 0 bit in I at position STEP, to get the index of the first vector to combine
  int bit = (1<<step);
  int m = i & (bit-1); // M = lower STEP-1 bits of I
  i = ((i-m)<<1)+m;
  int j = i | bit; // index of the second vector to combine
  // Combine values.  The power of ω corresponds to
  // P>>STEP, and shall be multiplied by the index M
  // of the butterfly in its group.
  combine(p,(p>>step)*m,x+i*p,x+j*p,y+i*p,y+j*p);
}

The timings of a full FFT using this code on a NVidia GTX285 (64-bit Linux, driver 195.30) are given in the following table. It only includes the log2(N) calls to the kernel, and shows the min time reached for various sizes of the working group. Normalization steps are not included.

  N       P      FFT (ms)
1024	 512       10
1024	1024       17
1024	2048       32
2048	1024       29
2048	2048       61
2048	4096      174
4096	2048      125
4096	4096      357

One butterfly reads the entire input buffer from global memory, and writes the entire output buffer back to global memory. For N=4096 and P=4096, we read+write 12 times (log2(N)=12) the 64 MiB buffers. 357 ms correspond to 2.1 GB/s of write throughput. The combine kernel is very likely memory bound (see the page), and we should be able to reach at least 40 GB/s on this GPU, or less than 17 ms for the full FFT in this case.

One core of a Core Quad Q9600 CPU runs a 100 Mbit product in approximately 1.3 s (using MPIR 1.3.0-RC4 or GMP). It corresponds to 3 FFT and the N products of P-digit numbers, meaning the FFT on the CPU takes less than 433 ms.

There are two obvious directions to optimize the throughput: collapse two or more butterflies into one single step (less memory access), and subdivide each butterfly into several threads (less serial processing).

In the next page, we will test the second option, and use one thread per word.