OpenCL Fast Fourier Transform

Eric Bainville - May 2010

Reference implementations

We will use several reference implementations to check our results, and compare execution times: FFTW and Intel MKL on the CPU, and NVidia CUFFT on the GPU.

The code measuring execution time of FFTW is:

// Run complex to complex X[2*N] to Y[2*N].  Return total time (s).
double run_fftw(int n,const float * x,float * y)
{
  fftwf_plan p1 = fftwf_plan_dft_1d(n,(fftwf_complex *)x,(fftwf_complex *)y,
				    FFTW_FORWARD,FFTW_ESTIMATE);
  const int nops = 10;
  double t = cl::realTime();
  for (int op = 0;op < nops;op++)
    {
      fftwf_execute(p1);
    }
  t = (cl::realTime() - t)/(double)nops;
  fftwf_destroy_plan(p1);
  return t;
}

cl::realTime() returns wallclock time in seconds (using gettimeofday). The equivalent Intel MKL (DFTI) code is:

double run_mkl(int n,const float * x,float * y)
{
  DFTI_DESCRIPTOR_HANDLE h;
  DftiCreateDescriptor(&h,DFTI_SINGLE,DFTI_COMPLEX,1,n);
  DftiSetValue(h,DFTI_PLACEMENT,DFTI_NOT_INPLACE);
  DftiCommitDescriptor(h);

  const int nops = 10;
  double t = cl::realTime();
  for (int op = 0;op < nops;op++)
    {
      DftiComputeForward(h,(void *)x,y);
    }
  t = (cl::realTime() - t)/(double)nops;

  DftiFreeDescriptor(&h);
  return t;
}

The CUFFT code is the following:

double run_cufft(int n,const float * x,float * y)
{
  cufftHandle plan;
  cufftComplex * inData = 0;
  cufftComplex * outData = 0;
  size_t dataSize = sizeof(cufftComplex) * n;

  cufftPlan1d(&plan,n,CUFFT_C2C,1);
  cudaMalloc((void **)(&inData),dataSize);
  cudaMalloc((void **)(&outData),dataSize);

  const int nops = 100;
  double t = cl::realTime();
  for (int op = 0;op < nops;op++)
    {
      // Send X to device
      cudaMemcpy(inData,x,dataSize,cudaMemcpyHostToDevice);

      // Run the FFT
      cufftExecC2C(plan,inData,outData,CUFFT_FORWARD);

      // Get Y from device
      cudaMemcpy(y,outData,dataSize,cudaMemcpyDeviceToHost);

      cudaThreadSynchronize();
    }
  t = (cl::realTime() - t)/(double)nops;

  cufftDestroy(plan);
  cudaFree(inData);
  cudaFree(outData);

  return t;
}

For N=224, corresponding to 128 MiB buffers, we get the following running times:

PackageQ9550Core i7GTX285
FFTW single thread1.6 s (1.3 Gflop/s)--
MKL single thread570 ms (3.5 Gflop/s)--
MKL multi thread470 ms (4.28 Gflop/s)152 ms (13.2 Gflop/s)-
CUFFT--19 ms + 100 ms I/O (106 Gflop/s)

The Gflop/s count is computed assuming a total number of real add+mul operations of 5.N.log2(N).

A slightly modified version of the CUFFT code was used to measure only the cufftExecC2C time. The host to device and then device to host copy of 128 MiB buffers takes 100 ms and corresponds to 2.7 GB/s, matching the host/device transfert speeds measured on the same machine, see GPU Benchmarks.

Before starting experimenting OpenCL code, it is interesting to note that reading+writing 128 MiB buffer on the GTX285 takes at best 2 ms using a trivial OpenCL kernel (essentially y[i] = x[i]). This time corresponds to 127 GB/s of total memory bandwidth (read+write). At this speed, if we want to remain under the 19 ms CUFFT time, the number of passes is limited to 9, supposing each pass will read+write the entire buffers.

We are now ready to start experimenting OpenCL FFT kernels: Radix-2 kernel.