Intel64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Memory Zero
This function sets to 0 all words of a vector (Z,n).
Conroe
The Core 2 Duo architecture can write one 128-bit word per clock cycle (the Athlon 64 could write two 64-bit words per cycle). Doing this requires using a 128-bit XMM register, as in the following code:
shr N, 1 pxor xmm0, xmm0 align 16 .a: movdqa [Z], xmm0 lea Z, [Z + 16] dec N jnz .a
The lea and dec are both computed in the same cycle as the movdqa, so the loop runs at 1 cycle/iteration, leading 0.50 cycle/word.
Bloomfield
The above code runs slower on a Core i7. We have to unroll the loop further to reach 0.50 cycle/word:
shr N, 2 pxor xmm0, xmm0 align 16 .a: movdqa [Z], xmm0 movdqa [Z+16], xmm0 add Z, 32 sub N, 1 jnz .a
Intel64 Multiprecision : Introduction | Top of Page | Intel64 Multiprecision : Unary OP |