AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Memory Zero

This function sets to 0 all words of a vector (Z,N).

A simple implementation, setting one word at each iteration of the loop, would be:

1 word per iteration:
	xor     ZERO, ZERO	; ZERO is any 64-bit register
	lea	Z, [Z + 8*N]	; Loop from the end
	neg	N
	; Loop body should start on a 16 byte boundary
        align   16
.a:
        mov     [Z + 8*N], ZERO
	inc	N
        jnz     .a

Here ZERO, is an auxiliary register, and N, Z are the two input registers. The execution time here is 2 cycles/word. The limitation comes from the jnz: there must be at least two cycles between two taken jumps.

The CPU provides 3 integer evaluation units, each one with an integer ALU and an AGU (address generation unit), but only two 64-bit memory operations (read or write) can occur at the same time. So we can clear at most 2 words per cycle: the minimal timing imposed by the memory access is 0.5 cycles/word.

Since the loop body must be at least 2 cycles, to reach the optimum, we must clear at least 4 words per iteration, as in the following:

4 words per iteration:
	shr	N, 2		; One iteration is 4 words
	xor     ZERO, ZERO
        align   16
.a:
        mov     [Z     ], ZERO
        mov     [Z +  8], ZERO
        mov     [Z + 16], ZERO
        mov     [Z + 24], ZERO
	lea	Z, [Z + 32]
	dec	N
        jnz     .a

One iteration requires 3 cycles, or 0.65 cycles/word. The lea updating Z may introduce a delay in the address computation. Moving it between the 2nd and 3rd mov, or replacing it with an add does not change the timing. We have to unroll the loop further, and use two alternating address registers Z and X to reach the optimal timing:

8 words per iteration, two address registers:
	shr     N, 3
        xor     ZERO, ZERO
        align   16
.a:
        mov     [Z     ], ZERO
        mov     [Z +  8], ZERO
	lea	X, [Z + 32]
        mov     [Z + 16], ZERO
        mov     [Z + 24], ZERO
        dec     N
        mov     [X     ], ZERO
        mov     [X +  8], ZERO
	lea	Z, [X + 32]
        mov     [X + 16], ZERO
        mov     [X + 24], ZERO
        jnz     .a

One iteration now takes 4 cycles, for an optimal timing of 0.50 cycles/word.