OpenCL Fast Fourier Transform
Eric Bainville - May 2010Introduction
In this article, we will experiment various OpenCL implementations of one-dimensional Fast Fourier Transform (FFT) algorithms.
This article is becoming rather long now. It was written in several sessions, interrupted by other higher priority tasks (i.e. paid contracts). The objective is to reach an FFT implementation in OpenCL with performances at least matching NVidia CUFFT, and running on both ATI and NVidia GPU's.
This objective is reached in the Radix 4,8,16,32 kernels page, using several chained calls of simple kernels not using shared memory. The experiments on kernels using shared memory are in progress, and the last two pages of the articles will be updated.
Contents
- Introduction - Introduction.
- DFT - Discrete Fourier Transform.
- FFT - Fast Fourier Transform algorithms.
- Reference implementations - FFTW, Intel MKL, and NVidia CUFFT.
- Radix-2 kernel - Simple radix-2 OpenCL kernel.
- Radix 4,8,16,32 kernels - Extension to radix-4,8,16, and 32 kernels.
- Radix-r kernels benchmarks - Benchmarks of the radix-r kernels.
- One work-group per DFT (1) - One DFT2r per work-group of size r, values in local memory.
- One work-group per DFT (2) - One DFT2r per work-group of size r, values in registers, transferts using local memory.
References
- Discrete Fourier Transform Wikipedia entry.
- Fast Fourier Transform Wikipedia entry.
- Vasily Volkov's page provides links to several GPU optimized numerical algorithms, including the FFT.
- Apple OpenCL FFT sample code.
Benchmarks
Execution times are measured on the following hardware:
Q9550: Intel Core 2 Quad Q9550 (4 cores) @2.83 GHz (stock speed) Chipset Intel P45 12GB of DDR2 @800 MHz Linux 64-bit kernel-2.6.32 glibc-2.10.1 gcc-4.3.4 fftw-3.2.2 mkl-10.2.4.032 Core i7: Intel Core i7 920 (4 cores, 8 threads) @3.33 GHz (overclocked) Chipset Intel X58 6GB of DDR3 @1.33 GHz Windows 7 Ultimate 64-bit vs-2010 mkl-10.2.4.032 GTX285: NVidia GTX285 1GB Chipset Intel P45 Linux 64-bit kernel-2.6.32 NVidia driver 195.36.24 + CUDA toolkit 3.0 OpenCL compiler option -cl-fast-relaxed-math HD5870: ATI Radeon HD5870 1GB Chipset Intel X58 Windows 7 Ultimate 64-bit ATI Catalyst 10.5 + Stream SDK 2.1
OpenCL FFT | Top of Page | OpenCL FFT (OLD) : DFT |