Intel64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function copies vector (X,N) to (Z,N). The vectors shall not overlap.
Since we can simultaneously load one and store one 128-bit word, the minimal timing of memory copy should be 0.50 cycles/word. Actually, we seem to be limited to 1.50 cycles/word, as for example in the following trivial code:
shr N, 1 align 16 .a: movapd xmm0, [X] movapd [Z], xmm0 lea X, [X + 16] lea Z, [Z + 16] dec N jnz .a
Despite numerous attemps, I could not find a faster pattern.
|Intel64 Multiprecision : Unary OP||Top of Page||Intel64 Multiprecision : Binary OP|