GPU matrix-vector product (gemv)

Eric Bainville - Feb 2010

Introduction

In this article, we will see various OpenCL implementations of the general matrix-vector product. This function is known in the BLAS standard library as sgemv (single precision) and dgemv (double precision).

Given an input mxn matrix A (A has m rows and n columns) and an input vector X of dimension n, we compute a vector Y of dimension m, defined by: y_i = Σ A_ik X_k. In other words, component i of Y is the dot product of row i of A (a vector of dimension n) and vector X.

Matrix A will be stored in column-major order. In memory, the matrix is represented by an array of m*n scalars (float or double). The first m values are the first column of A, the next m values are the second column, etc. Vectors X and Y are represented by arrays of n and m contiguous values respectively.

One thread per dot product - First version, one thread per component of Y.
P threads per dot product - Second version, p threads per component of Y, local memory receives output.
Two kernels - Third version, two kernels, p threads per component of Y, local memory caches input.
Conclusions - Summary of benchmarks, and conclusions.

Performance metrics

The total memory bandwidth M (GB/s) will be computed as M=10^-9.(m.n+m+n)*sizeof(scalar type)/T, where m.n+m+n is the input+output scalar count, and T is the execution time (wall clock seconds). The total compute load C (GFlops) will be computed as C=10^-9.m.(2n-1)/T, corresponding to m dot products of dimension n (n products and n-1 additions).

We will measure the execution time for m=n=4096. In this case, the A matrix is stored on 64 MB in single precision, and 128 MB in double precision. We run the tests on the following OpenCL devices:

GTX285: NVidia GTX285, Linux 2.6.31 x86_64, Driver 195.30, Cuda SDK 3.0 beta1
HD5870: ATI Radeon HD5870, Linux 2.6.31 x86_64, Driver CAL 1.4.519 (Catalyst 10.1), Stream SDK 2.0

CUBLAS (3.0 beta1) timings are the following:

GTX285, cublas, float:  M = 47 GB/s  C = 23 GFlops
GTX285, cublas, double: M = 90 GB/s  C = 22 GFlops

Double vs float

We will run the tests on double and float scalars. In the host (C) and device (OpenCL) code, we define a USE_DOUBLE flag set to 0 (float) or 1 (double). This flag is passed to the OpenCL code by setting an option string -DUSE_DOUBLE=0 or -DUSE_DOUBLE=1 in clBuildProgram. In the OpenCL code, we need to enable the fp64 extension if the double type is used:

#ifndef USE_DOUBLE
#define USE_DOUBLE 0
#endif

#if USE_DOUBLE
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
typedef double scalar_t;
#else
typedef float scalar_t;
#endif

The host side code is identical, except the #pragma line is omitted.

Next page presents a simple implementation of the product, where each thread computes one dot product.

GPU matrix-vector product (gemv)

Introduction

Contents

Performance metrics

Double vs float