CPU/GPU Multiprecision Mandelbrot Set

Eric Bainville - Dec 2009

Benchmarks

Update (May 2010). I updated the benchmarks to include new NVidia and AMD drivers.

I run the tests on two machines:

Machine A:
CPU Intel Core i7 920 (4 cores, 8 threads) @3.33 GHz (overclocked)
Chipset Intel X58
6GB of DDR3 @1.33 GHz
GPU ATI Radeon HD5870 1GB

Machine B:
CPU Intel Core 2 Quad Q9550 (4 cores) @2.83 GHz (stock speed)
Chipset Intel P45
12GB of DDR2 @800 MHz
GPU NVidia GTX285 1GB

On each machine I run the tests on two systems:

Linux 64-bit kernel-2.6.32 glibc-2.10.1 gcc-4.3.4
NVidia driver 195.36.24 + NVidia SDK 3.0
Catalyst 9.12 Hotfix + Stream SDK 2.0 (not updated yet)

Windows 7 64-bit vs2008-sp1 (/fp:fast)
NVidia dev driver 197.13 + CUDA toolkit 3.0
Catalyst 10.4 + Stream SDK 2.1
   
The "Mini Set 1" (400 iterations) and "Julia Island" (20000 iterations) benchmarks.

Each test computes a 2048x2048 image of a specific site. Code is either multithreaded C (C x8 means "8 threads"), or OpenCL. In the tables, DNF means "Did Not Finish", either system freeze or driver timeout, and TBU means "To Be Updated", as I did not run the tests with the latest drivers. The green (resp. red) cell for each site highlight the best GPU (resp. CPU) time.

Site       Mini Set 1
Center X   -0.15710375803
Center Y   +1.03258348530
Pixel step +0.0000375

Site       Julia Island (ref.: Mandelbrot zoom from the Wikipedia page)
Center X   -0.7436438870371587100514222
Center Y   +0.1318259042053126002256675
Pixel step +0.00000000000002

Hardware float

Mini Set 1Julia Island
Device, CodeLinuxWindowsLinuxWindows
Q9550, C x8 150ms130ms6.9s5.5s
Core i7, C x8 78ms97ms3.1s2.5s
Core i7, OpenCL TBU130msTBU2.6s
GTX285, OpenCL 41ms48ms200ms209ms
HD5870, OpenCL TBU52msTBU200ms

Hardware double

Mini Set 1Julia Island
Device, CodeLinuxWindowsLinuxWindows
Q9550, C x1 560ms555ms48s38s
Q9550, C x8 156ms129ms12s9.6s
Core i7, C x1 396ms397ms33s31s
Core i7, C x8 70ms95ms4.8s4.3s
Core i7, OpenCL TBU120msTBU4.4s
GTX285, OpenCL 77ms100ms2.1s2.7s
HD5870, OpenCL TBU58msTBU1.0s

fp128

Here, the C code is the code described in Fixed Point Reals. This code is not optimized, and could probably be accelerated using SSE extensions. The OpenCL is described in fp128 for OpenCL.

Mini Set 1Julia Island
Device, CodeLinuxWindowsLinuxWindows
Q9550, C x8 4.5s3.8s370s370s
Core i7, C x8 3.5s3.0s300s260s
Core i7, OpenCL TBU6.5sTBU274s
GTX285, OpenCL 4.2s4.2sDNF160s
HD5870, OpenCL (wg=8) TBU1.8sTBU72s

Comments on the benchmarks

Floating point (32 and 64 bits) execution time (ms) for the "Julia Island" site, on Core i7, GTX285, and HD5870.

We see here the excellent performance of the GPU hardware float units over the CPU. For the most intensive computations (Julia Island), we have a 12x speed ratio between the fastest CPU and the GPU's in single precision. In double precision, the HD5870 is 4x faster than the CPU, and the GTX285 is 2x faster than the CPU. For fp128, we get a 4x speedup over the CPU for the HD5870, and 2x for the GTX285.

Note the nice performance of OpenCL running on the Core i7. OpenCL provides a credible alternative to more demanding programming options (threads, SSE).

I included C single thread timing to highlight the excellent performance of the 2 hardware threads per core (HyperThreading) in the Core i7. Both processors have 4 cores, but running 8 threads provides a 4x improvement on the Q9550, and a 8x improvement on the Core i7.

The fp128 running times depend significantly on the choice of the subdivision parameters: number of kernels, number of threads, and workgroup size.

Let's conclude with some general remarks, keeping in mind they are based only on a very small set of benchmarks and in a single application: take them with a grain of salt...