How fast can we compute 1D gradient?

Eric Bainville - Oct 2009

In this page, I will relate my quest of the fastest code to compute 1D gradient of a float vector with the simple kernel [-1,0,+1].

The target CPU is an Intel Core i7 920 (Bloomfield core). Timings are measured in CPU cycles per float. The code is generated by Microsoft Visual C++ 2008 SP1 with option /O2, and various code generation options: no SSE (noted /arch:none here) or /arch:SSE or /arch:SSE2, and /fp:precise or /fp:fast.

On input, we have an array of float x[n] and the output is an array float y[n] defined by:

y[0]   = 0
y[i]   = x[i+1] - x[i-1] for i=1,2,...,n-2
y[n-1] = 0

n will be supposed to be a multiple of 4 to avoid special cases later in the SSE code.

In the following pages, we will first test a few C implementations.