How fast can we compute the dot product?

Eric Bainville - Dec 2007

C code

A straightforward C implementation of the dot product is the following:

Real dot(int n,const Real * x,const Real * y)
{
  Real s = 0;
  for (int i=0;i<n;i++) s += x[i] * y[i];
  return s;
}

Here Real is either float or double. This code runs at 3.00 cycles per component both for float and double, because gcc generates the same non-SIMD SSE code. We can improve the running time by unrolling the loop, and separating the even and odd components in the sum:

Real dot(int n,const Real * x,const Real * y)
{
  int i;
  Real s0,s1;
  s0 = s1 = 0;
  for (i=n>>1;i>0;i--)
    {
      s0 += x[0] * y[0];
      s1 += x[1] * y[1];
      x += 2;
      y += 2;
    }
  return s0 + s1;
}

This code runs at 2.00 cycles per component for float and double. Unrolling further the loop to 4 components per iteration does not change the running time.