OpenCL Fast Fourier Transform

Eric Bainville - May 2010, updated March 2011

Reference implementations

We will use several reference implementations to check our results, and compare execution times: FFTW and Intel MKL on the CPU, and NVidia CUFFT on the GPU. The following code snippets show how FFT was computed using the various API. n is the size of the DFT to compute, x[2*n] the input array, and y[2*n] the output array.

// FFTW API:
fftwf_plan p1 = fftwf_plan_dft_1d(n,(fftwf_complex *)x,(fftwf_complex *)y,FFTW_FORWARD,FFTW_ESTIMATE);
fftwf_execute(p1);
fftwf_destroy_plan(p1);

// Intel MKL native API (DFTI):
DFTI_DESCRIPTOR_HANDLE h;
DftiCreateDescriptor(&h,DFTI_SINGLE,DFTI_COMPLEX,1,n);
DftiSetValue(h,DFTI_PLACEMENT,DFTI_NOT_INPLACE);
DftiCommitDescriptor(h);
DftiComputeForward(h,(void *)x,y);
DftiFreeDescriptor(&h);

// CUFFT API:
cufftHandle plan;
cufftComplex * inData = 0;
cufftComplex * outData = 0;
size_t dataSize = sizeof(cufftComplex) * n;
cufftPlan1d(&plan,n,CUFFT_C2C,1);
cudaMalloc((void **)(&inData),dataSize);
cudaMalloc((void **)(&outData),dataSize);
cudaMemcpy(inData,x,dataSize,cudaMemcpyHostToDevice);  // write X
cufftExecC2C(plan,inData,outData,CUFFT_FORWARD);       // compute
cudaMemcpy(y,outData,dataSize,cudaMemcpyDeviceToHost); // read Y
cudaDeviceSynchronize();
cufftDestroy(plan);
cudaFree(inData);
cudaFree(outData);

Before starting to write FFT kernels, we will see in the next page how to handle the single/double precision selection: Single and double precision.