Intel64 Multi-Precision Arithmetic

Eric Bainville - Dec 2006

Memory Zero

This function sets to 0 all words of a vector (Z,n).

The Core 2 Duo architecture can write one 128-bit word per clock cycle (the Athlon 64 could write two 64-bit words per cycle). Doing this requires using a 128-bit XMM register, as in the following code:

	shr     N, 1
        pxor    xmm0, xmm0
        align   16
.a:
        movdqa  [Z], xmm0
	lea	Z, [Z + 16]
        dec     N
        jnz     .a

The lea and dec are both computed in the same cycle as the movdqa, so the loop runs at 1 cycle/iteration, leading 0.50 cycle/word.