GPU Benchmarks

Eric Bainville - Nov 2009

Introduction

After the assembly experiments on the CPU, we see in these pages how we can program a GPU using OpenCL to perform multiprecision arithmetic. This first part is focused on measuring the potential speed of multithread CPU and GPU for multiprecision computations.

A GPU provides a highly parallel architecture, initially dedicated to the fixed 3D rendering pipeline. In the previous years, parts of the pipeline became more and more programmable, and now rendering is running on a set of "general purpose" processing cores. On modern GPU, these cores can be used directly for general purpose computations.

This article intends to provide a fair comparison of recent CPU and GPU in realistic conditions (i.e. actually computing something). For each target, we will try to provide the fastest implementation of each algorithm, using multithreading and vector instructions when needed.

Update (May 2010). I have updated this article with tests using new versions of the drivers.

Memory operations - Basic memory operations: copy and set to 0.

Addition - How multiprecision addition can be implemented on highly parallel architectures.

Available Flops - Raw processing power of the CPU and the GPU.

Product by one digit - Multiply a multiprecision number and a single digit.

OpenCL, hardware, drivers, and software

GPU programming

Several API allow the execution of code on the GPU:

Stream (low level, ATI only),
CUDA (low level, NVIDIA only),
OpenGL using GLSL shaders,
DirectX using Compute (Windows only),
OpenCL (GPU and multi-core CPU).

OpenCL is the only dedicated API running on all systems and hardware, and we will use it in these pages.

Original post (Nov 2009). Today, OpenCL on the GPU is still mainly in Beta. NVidia has released a public Beta of their main driver series featuring OpenCL (Windows release 195.39, Linux release 195.17). AMD has released mid-October a Beta version (Stream 2.0 beta 4) running on both GPU and CPU.

Update (May 2010). OpenCL support by NVidia is now mature. Runtime support is released as part of the public driver series, and SDK support is integrated in the CUDA toolkit. On the AMD side, despite a potentially superior hardware (at least until Fermi was released), OpenCL software support was below expectations, even if it has been dramatically improved since Nov 2009. The 2.1 Stream SDK and the Catalyst 10.4 drivers released in May 2010 now provide much more features (image support, etc.) and better performance.

OpenCL is not only for GPU, and OpenCL for the CPU allows the use of all the power of modern multicores processors without having to manage threads and SSE instructions explicitely. The AMD OpenCL drivers are the only ones to provide CPU support, and the performance is comparable to threads+SSE code that would take much longer to write (not speaking about maintaining and porting it).

In the curves, I removed the old HD5870 Linux measures, since they have probably improved a lot with the new drivers (as they did on Windows). I still have to update them.

Test systems

I run the tests on two machines:

Machine A:
CPU Intel Core i7 920 (4 cores, 8 threads) @3.33 GHz (overclocked)
Chipset Intel X58
6GB of DDR3 @1.33 GHz
GPU ATI Radeon HD5870 1GB

Machine B:
CPU Intel Core 2 Quad Q9550 (4 cores) @2.83 GHz (stock speed)
Chipset Intel P45
12GB of DDR2 @800 MHz
GPU NVidia GTX285 1GB

On each machine I run the tests on two systems:

Linux 64-bit kernel-2.6.32 glibc-2.10.1 gcc-4.3.4
NVidia driver 195.36.24 + CUDA toolkit 3.0
ATI driver 2.0-beta4 + Stream SDK 2.0 beta 4 (not updated yet)

Windows 7 64-bit vs2008-sp1
NVidia dev driver 197.13 + CUDA toolkit 3.0
ATI Catalyst 10.4 + Stream SDK 2.1

To avoid ambiguity, we adopt the (standard) conventions:

1 KiB = 2¹⁰ B, 1 MiB = 2²⁰ B, 1 GiB = 2³⁰ B

1 KB = 10³ B, 1 MB = 10⁶ B, 1 GB = 10⁹ B

We measure the effective wallclock time (not the device execution time reported by event profiling), because it is what matters to the user sitting in front of the machine.

Before entering the subject and effectively operate on large integers, we will evaluate the memory and computational power of both the GPU and the CPU. The next page is devoted to memory copy and zero operations.

Source code