# AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

## Memory Zero

This function sets to 0 all words of a vector (Z,N).

A simple implementation, setting one word at each iteration of the loop, would be:

```1 word per iteration:
xor     ZERO, ZERO	; ZERO is any 64-bit register
lea	Z, [Z + 8*N]	; Loop from the end
neg	N
; Loop body should start on a 16 byte boundary
align   16
.a:
mov     [Z + 8*N], ZERO
inc	N
jnz     .a
```

Here ZERO, is an auxiliary register, and N, Z are the two input registers. The execution time here is 2 cycles/word. The limitation comes from the jnz: there must be at least two cycles between two taken jumps.

The CPU provides 3 integer evaluation units, each one with an integer ALU and an AGU (address generation unit), but only two 64-bit memory operations (read or write) can occur at the same time. So we can clear at most 2 words per cycle: the minimal timing imposed by the memory access is 0.5 cycles/word.

Since the loop body must be at least 2 cycles, to reach the optimum, we must clear at least 4 words per iteration, as in the following:

```4 words per iteration:
shr	N, 2		; One iteration is 4 words
xor     ZERO, ZERO
align   16
.a:
mov     [Z     ], ZERO
mov     [Z +  8], ZERO
mov     [Z + 16], ZERO
mov     [Z + 24], ZERO
lea	Z, [Z + 32]
dec	N
jnz     .a
```

One iteration requires 3 cycles, or 0.65 cycles/word. The lea updating Z may introduce a delay in the address computation. Moving it between the 2nd and 3rd mov, or replacing it with an add does not change the timing. We have to unroll the loop further, and use two alternating address registers Z and X to reach the optimal timing:

```8 words per iteration, two address registers:
shr     N, 3
xor     ZERO, ZERO
align   16
.a:
mov     [Z     ], ZERO
mov     [Z +  8], ZERO
lea	X, [Z + 32]
mov     [Z + 16], ZERO
mov     [Z + 24], ZERO
dec     N
mov     [X     ], ZERO
mov     [X +  8], ZERO
lea	Z, [X + 32]
mov     [X + 16], ZERO
mov     [X + 24], ZERO
jnz     .a
```

One iteration now takes 4 cycles, for an optimal timing of 0.50 cycles/word.