How fast can we compute 1D gradient?
Eric Bainville - Oct 2009In this page, I will relate my quest of the fastest code to compute 1D gradient of a float vector with the simple kernel [-1,0,+1].
The target CPU is an Intel Core i7 920 (Bloomfield core). Timings are measured in CPU cycles per float. The code is generated by Microsoft Visual C++ 2008 SP1 with option /O2, and various code generation options: no SSE (noted /arch:none here) or /arch:SSE or /arch:SSE2, and /fp:precise or /fp:fast.
On input, we have an array of float x[n] and the output is an array float y[n] defined by:
y[0] = 0 y[i] = x[i+1] - x[i-1] for i=1,2,...,n-2 y[n-1] = 0
n will be supposed to be a multiple of 4 to avoid special cases later in the SSE code.
In the following pages, we will first test a few C implementations.
GPU Benchmarks | Top of Page | SSE gradient : C implementation |