OpenCL Fast Fourier Transform

Eric Bainville - May 2010, updated March 2011

Higher radix kernels

Radix-4 kernel

We will now merge two consecutive iterations. Two iterations of the FFT radix-2 algorithm will combine 4 sub-sequences of length p into one sequence of length 4p, corresponding to a "radix-4" algorithm.

The radix-4 "butterfly" takes 4 input cells and produces 4 outputs. It computes 4 radix-2 butterflies.

The radix-4 butterfly simplifies to a length 4 DFT applied on scaled inputs (a+0.u,k)p, (a+1.u,k)p * ω1, (a+2.u,k)p * ω2, and (a+3.u,k)p * ω3, where ω=e-2*π*i/(2p).

First, we have to define the following new functions and macros:

// twiddle_P_Q(A) returns A * EXP(-P*PI*i/Q)
real2_t twiddle_1_2(real2_t a)
  // A * (-i)
  return (real2_t)(a.y,-a.x);

// In-place DFT-4, output is (a,c,b,d). Arguments must be variables.
#define DFT4(a,b,c,d) { DFT2(a,c); DFT2(b,d); d=twiddle_1_2(d); DFT2(a,b); DFT2(c,d); }

The radix-4 forward kernel is:

// Compute T x DFT-4.
// T is the number of threads.
// N = 4*T is the size of input vectors.
// X[N], Y[N]
// P is the length of input sub-sequences: 1,4,16,...,T.
// Each DFT-4 has input (X[I],X[I+T],X[I+2*T],X[I+3*T]), I=0..T-1,
// and output (Y[J],Y|J+P],Y[J+2*P],Y[J+3*P], J = I with two 0 bits inserted at postion P. */
__kernel void fftRadix4Kernel(__global const real2_t * x,__global real2_t * y,int p)
  int t = get_global_size(0); // thread count
  int i = get_global_id(0); // thread index
  int k = i&(p-1); // index in input sequence, in 0..P-1
  int j = ((i-k)<<2) + k; // output index
  real_t alpha = -FFT_PI*(real_t)k/(real_t)(2*p);

  // Read and twiddle input
  x += i;
  real2_t u0 = x[0];
  real2_t u1 = twiddle(x[t],1,alpha);
  real2_t u2 = twiddle(x[2*t],2,alpha);
  real2_t u3 = twiddle(x[3*t],3,alpha);

  // In-place DFT-4

  // Shuffle and write output
  y += j;
  y[0]   = u0;
  y[p]   = u2;
  y[2*p] = u1;
  y[3*p] = u3;

This kernel must be called for p=1, p=4,..., p=N/4. If N is not a power of 4, one radix-2 kernel shall be inserted.