Intel64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function sets to 0 all words of a vector (Z,n).
The Core 2 Duo architecture can write one 128-bit word per clock cycle (the Athlon 64 could write two 64-bit words per cycle). Doing this requires using a 128-bit XMM register, as in the following code:
shr N, 1 pxor xmm0, xmm0 align 16 .a: movdqa [Z], xmm0 lea Z, [Z + 16] dec N jnz .a
The lea and dec are both computed in the same cycle as the movdqa, so the loop runs at 1 cycle/iteration, leading 0.50 cycle/word.
The above code runs slower on a Core i7. We have to unroll the loop further to reach 0.50 cycle/word:
shr N, 2 pxor xmm0, xmm0 align 16 .a: movdqa [Z], xmm0 movdqa [Z+16], xmm0 add Z, 32 sub N, 1 jnz .a
|Intel64 Multiprecision : Introduction||Top of Page||Intel64 Multiprecision : Unary OP|