# GPU matrix-vector product (gemv)

Eric Bainville - Feb 2010## Introduction

In this article, we will see various OpenCL implementations of the general matrix-vector product. This function is known in the BLAS standard library as sgemv (single precision) and dgemv (double precision).

Given an input mxn matrix A (A has m rows and n columns) and an input vector X of dimension n,
we compute a vector Y of dimension m, defined by: y_{i} = Σ A_{ik} X_{k}. In other words, component
i of Y is the dot product of row i of A (a vector of dimension n) and vector X.

Matrix A will be stored in column-major order. In memory, the matrix is represented by an array of m*n scalars (float or double). The first m values are the first column of A, the next m values are the second column, etc. Vectors X and Y are represented by arrays of n and m contiguous values respectively.

### Contents

- One thread per dot product - First version, one thread per component of Y.
- P threads per dot product - Second version, p threads per component of Y, local memory receives output.
- Two kernels - Third version, two kernels, p threads per component of Y, local memory caches input.
- Conclusions - Summary of benchmarks, and conclusions.

### Performance metrics

The total memory bandwidth M (GB/s) will be computed as M=10^{-9}.(m.n+m+n)*sizeof(scalar type)/T,
where m.n+m+n is the input+output scalar count, and T is the execution time (wall clock seconds). The total compute load
C (GFlops) will be computed as C=10^{-9}.m.(2n-1)/T, corresponding to m dot products of dimension n
(n products and n-1 additions).

We will measure the execution time for m=n=4096. In this case, the A matrix is stored on 64 MB in single precision, and 128 MB in double precision. We run the tests on the following OpenCL devices:

GTX285: NVidia GTX285, Linux 2.6.31 x86_64, Driver 195.30, Cuda SDK 3.0 beta1 HD5870: ATI Radeon HD5870, Linux 2.6.31 x86_64, Driver CAL 1.4.519 (Catalyst 10.1), Stream SDK 2.0

CUBLAS (3.0 beta1) timings are the following:

GTX285, cublas, float: M = 47 GB/s C = 23 GFlops GTX285, cublas, double: M = 90 GB/s C = 22 GFlops

### Double vs float

We will run the tests on double and float scalars. In the host (C) and device (OpenCL) code, we define a USE_DOUBLE flag set to 0 (float) or 1 (double). This flag is passed to the OpenCL code by setting an option string -DUSE_DOUBLE=0 or -DUSE_DOUBLE=1 in clBuildProgram. In the OpenCL code, we need to enable the fp64 extension if the double type is used:

#ifndef USE_DOUBLE #define USE_DOUBLE 0 #endif #if USE_DOUBLE #pragma OPENCL EXTENSION cl_khr_fp64 : enable typedef double scalar_t; #else typedef float scalar_t; #endif

The host side code is identical, except the #pragma line is omitted.

Next page presents a simple implementation of the product, where each thread computes one dot product.

OpenCL FFT (OLD) | Top of Page | OpenCL GEMV : One thread per dot product |