How fast can we compute the dot product?

Eric Bainville - Dec 2007

SSE code

Let's consider first the case of float (32-bit) coordinates. We can read 4 components per cycle using movaps. Once we have two blocks of 4 components in two SSE registers, we multiply them together using mulps, and then accumulate the product into the result using addps. The result will be a 4-component vector. After the end of the loop we compute the final sum by adding together the 4 components of this vector.

From Agner Fog's excellent documents, we know the scheduling constraints of the instructions we use:
- movaps on port P2, result after 2 cycles, one call per cycle,
- mulps on port P0, result after 4 cycles, one call per cycle,
- addps on port P1, result after 3 cycles, one call per cycle,
- lea on port P0,
- jnz on port P5.

Based on these values, we should be limited by memory access. Reading 4 components for each vector requires at least 2 cycles, giving a theoretical time of 0.50 cycles per component.

Let's try the simplest pattern:

	shr	N, 2
	pxor	S0, S0
        align   16
.a:
	movaps	xmm0, [X]
	mulps	xmm0, [Y]
	addps	S0, xmm0
	lea	X, [X + 16]
	lea	Y, [Y + 16]
	dec	N
        jnz	.a

	; Compute result: sum of the 4 components of S0

It runs at 1.00 cycles/component. Unrolling the loop, with the sum in two different registers S0 and S1, we get 0.66 cycles/component with the following pattern:

	movaps	xmm0, [X]
	mulps	xmm0, [Y]
	addps	S0, xmm0
	movaps	xmm1, [X + 16]
	mulps	xmm1, [Y + 16]
	addps	S1, xmm1
	lea	X, [X + 32]
	lea	Y, [Y + 32]

We can reach 0.63 cycles/component by permuting the instructions:

	movaps	xmm0, [X]
	movaps	xmm1, [X + 16]
	mulps	xmm0, [Y]
	mulps	xmm1, [Y + 16]
	addps	S0, xmm0
	addps	S1, xmm1
	lea	X, [X + 32]
	lea	Y, [Y + 32]

The minimal timing of 0.50 cycles/component requires further unrolling (16 components per iteration) and "pipelined" computations to help the processor schedule the instructions. In this case, the first and last iterations are slighty different, and can't be part of the main loop. Note the cyclic pattern of register numbering, and only two sum registers are needed (S0 and S1). This is the pattern of the main loop:

	movaps	xmm0, [X]
	mulps	xmm3, [Y - 16]
	addps	S0, xmm2
	
	movaps	xmm1, [X + 16]
	mulps	xmm0, [Y]
	addps	S1, xmm3
	
	movaps	xmm2, [X + 32]
	mulps	xmm1, [Y + 16]
	addps	S0, xmm0
	
	movaps	xmm3, [X + 48]
	mulps	xmm2, [Y + 32]
	addps	S1, xmm1

	lea	X, [X + 64]
	lea	Y, [Y + 64]

In double precision, the same pattern (with -pd functions instead of -ps) reaches the optimal time of 1.00 cycles/component.