GPU matrix-vector product (gemv)
Eric Bainville - Feb 2010Introduction
In this article, we will see various OpenCL implementations of the general matrix-vector product. This function is known in the BLAS standard library as sgemv (single precision) and dgemv (double precision).
Given an input mxn matrix A (A has m rows and n columns) and an input vector X of dimension n, we compute a vector Y of dimension m, defined by: yi = Σ Aik Xk. In other words, component i of Y is the dot product of row i of A (a vector of dimension n) and vector X.
Matrix A will be stored in column-major order. In memory, the matrix is represented by an array of m*n scalars (float or double). The first m values are the first column of A, the next m values are the second column, etc. Vectors X and Y are represented by arrays of n and m contiguous values respectively.
Contents
- One thread per dot product - First version, one thread per component of Y.
- P threads per dot product - Second version, p threads per component of Y, local memory receives output.
- Two kernels - Third version, two kernels, p threads per component of Y, local memory caches input.
- Conclusions - Summary of benchmarks, and conclusions.
Performance metrics
The total memory bandwidth M (GB/s) will be computed as M=10-9.(m.n+m+n)*sizeof(scalar type)/T, where m.n+m+n is the input+output scalar count, and T is the execution time (wall clock seconds). The total compute load C (GFlops) will be computed as C=10-9.m.(2n-1)/T, corresponding to m dot products of dimension n (n products and n-1 additions).
We will measure the execution time for m=n=4096. In this case, the A matrix is stored on 64 MB in single precision, and 128 MB in double precision. We run the tests on the following OpenCL devices:
GTX285: NVidia GTX285, Linux 2.6.31 x86_64, Driver 195.30, Cuda SDK 3.0 beta1 HD5870: ATI Radeon HD5870, Linux 2.6.31 x86_64, Driver CAL 1.4.519 (Catalyst 10.1), Stream SDK 2.0
CUBLAS (3.0 beta1) timings are the following:
GTX285, cublas, float: M = 47 GB/s C = 23 GFlops GTX285, cublas, double: M = 90 GB/s C = 22 GFlops
Double vs float
We will run the tests on double and float scalars. In the host (C) and device (OpenCL) code, we define a USE_DOUBLE flag set to 0 (float) or 1 (double). This flag is passed to the OpenCL code by setting an option string -DUSE_DOUBLE=0 or -DUSE_DOUBLE=1 in clBuildProgram. In the OpenCL code, we need to enable the fp64 extension if the double type is used:
#ifndef USE_DOUBLE #define USE_DOUBLE 0 #endif #if USE_DOUBLE #pragma OPENCL EXTENSION cl_khr_fp64 : enable typedef double scalar_t; #else typedef float scalar_t; #endif
The host side code is identical, except the #pragma line is omitted.
Next page presents a simple implementation of the product, where each thread computes one dot product.
OpenCL FFT (OLD) | Top of Page | OpenCL GEMV : One thread per dot product |