OpenCL Training Course
Objectives
- Learn CPU/GPU programming with OpenCL.
- Know what (not) to expect from a CPU or GPU.
- Understand heavy multithreading and how it is mapped to the hardware.
- Measure OpenCL code performance, locate and solve bottlenecks.
- Write efficient OpenCL code.
Audience
The target audience is the advanced C/C++ developer with no or little knowledge of OpenCL, and notions of multithreading.
Organization
Training is on-site, for groups up to 10 persons.
From the classroom, each participant must have access to a computer with an OpenCL SDK installed (ATI or NVidia) and an OpenCL device. The classroom shall provide a video projector and a whiteboard.
The course duration is 3 to 5 (or more) consecutive days, 7 hours per day. There are approximately 5 hours of presentation in total. The remaining time is devoted to programming together a series of classic problems, similar to the OpenCL GEMV page of this site. Each problem is used to introduce new notions, and to experiment efficient/inefficient variants of the code.
After the first 3 days, the remaining days are devoted to activities specific to your application field. If you are working on a project involving OpenCL, we can work together to define its architecture and start develop a first prototype.
Universities. A lot of traffic on this page comes from universities. I am available and open to collaborations. For example I can animate a one-week (intense) course on parallel programming on GPU (theory and practice). In Europe, I am available for regular courses (say 4 hours a week).
Cost
I charge 700 EUR/day, plus travel and accommodation expenses. Contact me for further details.
Program
The major part of the training is spent programming, with a few additional slides used when needed, mainly at the beginning of the course.
Slides
Architecture of some recent CPU and GPU
- Intel Nehalem
- NVidia GT200
- NVidia Fermi
- AMD Evergreen
How can an NVidia GTX285 (1.4 B transistors, running at 1.3 GHz) be 50x faster than a Core i7 (731 M transistors, running at 3 GHz) on some problems? We will see how the architecture of each chip is balanced between memory and computation, and how instruction latency and throughput are managed.
Introduction to OpenCL
- Terminology
- Host / Device
- Memory model
- Execution Model
Host-side OpenCL
- Connecting to a device, platforms
- Host objects: device, command queue, program, kernel, buffer, image
Device-side OpenCL
- The OpenCL C programming language
- How code is executed on hardware
Efficient OpenCL
- When (not) to use a GPU
- Memory latency and access patterns
- ALU latency
- Using local memory
- Synchronizing threads
- Warps/Wavefronts, work groups, and GPU cores
- Profiling
- Code design guidelines
Activities
Some activities can be skipped, and we may start experimenting on other subjects, depending on how things evolve during the course.
For each problem, I provide a code skeleton as a starting point, and you only have to focus on the interesting part.
Depending on your needs, programming can be on Linux or Windows, in C or C++.
Hello OpenCL
- List platforms and devices
- Connect to a device
- Query platform and device properties
Buffers
- Manipulating OpenCL buffers
- Moving data around: CPU to/from GPU, GPU to GPU
Kernels
- Compiling and running code on the GPU
- OpenCL compute model: items, groups
Sum
- Sum all values of an array
- Synchronize threads and share data inside a group
Matrix-vector product
- An example of memory-bound task
- Efficient/inefficient memory access patterns
- See OpenCL GEMV
Mandelbrot set
- An example of compute-bound task
- Influence of flow control instructions
- See GPU Mandelbrot Set
Matrix-matrix product
- Compute bound or memory bound?
- Shared memory and registers
Convolution
- Another classic problem
- Shared memory and data access patterns
1D FFT
- How fast can we make it?
- The influence of memory access pattern
- Small independent threads or work-groups saturating all resources?
- See OpenCL FFT
Heat equation (NEW)
- Another cool problem featuring Joseph Fourier :-)
- Using 2D images
- OpenCL/OpenGL interoperability
- Qt interface
Video processing (NEW)
- Grab frames from a webcam, process and display
- Using 2D images
- OpenCL/OpenGL interoperability
- Qt interface
| Eric Bainville | Top of Page | OpenCL FFT |
