OpenCL Training Course

Objectives

- Learn CPU/GPU programming with OpenCL.
- Know what (not) to expect from a CPU or GPU.
- Understand heavy multithreading and how it is mapped to the hardware.
- Measure OpenCL code performance, locate and solve bottlenecks.
- Write efficient OpenCL code.

Audience

The target audience is the advanced C/C++ developer with no or little knowledge of OpenCL, and notions of multithreading.

Organization

Training is on-site, for groups up to 10 persons.

From the classroom, each participant must have access to a computer with an OpenCL SDK installed (ATI or NVidia) and an OpenCL device. The classroom shall provide a video projector and a whiteboard.

The course duration is 3 to 5 (or more) consecutive days, 7 hours per day. There are approximately 5 hours of presentation in total. The remaining time is devoted to programming together a series of classic problems, similar to the OpenCL GEMV page of this site. Each problem is used to introduce new notions, and to experiment efficient/inefficient variants of the code.

After the first 3 days, the remaining days are devoted to activities specific to your application field. If you are working on a project involving OpenCL, we can work together to define its architecture and start develop a first prototype.

Universities. A lot of traffic on this page comes from universities. I am available and open to collaborations. For example I can animate a one-week (intense) course on parallel programming on GPU (theory and practice). In Europe, I am available for regular courses (say 4 hours a week).

Cost

I charge 700 EUR/day, plus travel and accommodation expenses. Contact me for further details.

Program

The major part of the training is spent programming, with a few additional slides used when needed, mainly at the beginning of the course.

Slides

Architecture of some recent CPU and GPU

- Intel Nehalem
- NVidia GT200
- NVidia Fermi
- AMD Evergreen

How can an NVidia GTX285 (1.4 B transistors, running at 1.3 GHz) be 50x faster than a Core i7 (731 M transistors, running at 3 GHz) on some problems? We will see how the architecture of each chip is balanced between memory and computation, and how instruction latency and throughput are managed.

Introduction to OpenCL

- Terminology
- Host / Device
- Memory model
- Execution Model

Host-side OpenCL

- Connecting to a device, platforms
- Host objects: device, command queue, program, kernel, buffer, image

Device-side OpenCL

- The OpenCL C programming language
- How code is executed on hardware

Efficient OpenCL

- When (not) to use a GPU
- Memory latency and access patterns
- ALU latency
- Using local memory
- Synchronizing threads
- Warps/Wavefronts, work groups, and GPU cores
- Profiling
- Code design guidelines

Activities

Some activities can be skipped, and we may start experimenting on other subjects, depending on how things evolve during the course.

For each problem, I provide a code skeleton as a starting point, and you only have to focus on the interesting part.

Depending on your needs, programming can be on Linux or Windows, in C or C++.

Hello OpenCL

- List platforms and devices
- Connect to a device
- Query platform and device properties

Buffers

- Manipulating OpenCL buffers
- Moving data around: CPU to/from GPU, GPU to GPU

Kernels

- Compiling and running code on the GPU
- OpenCL compute model: items, groups

Sum

- Sum all values of an array
- Synchronize threads and share data inside a group

Matrix-vector product

- An example of memory-bound task
- Efficient/inefficient memory access patterns
- See OpenCL GEMV

Mandelbrot set

- An example of compute-bound task
- Influence of flow control instructions
- See GPU Mandelbrot Set

Matrix-matrix product

- Compute bound or memory bound?
- Shared memory and registers

Convolution

- Another classic problem
- Shared memory and data access patterns

1D FFT

- How fast can we make it?
- The influence of memory access pattern
- Small independent threads or work-groups saturating all resources?
- See OpenCL FFT

Heat equation (NEW)

- Another cool problem featuring Joseph Fourier :-)
- Using 2D images
- OpenCL/OpenGL interoperability
- Qt interface

Video processing (NEW)

- Grab frames from a webcam, process and display
- Using 2D images
- OpenCL/OpenGL interoperability
- Qt interface