Intel64 Multi-Precision Arithmetic
Eric Bainville - Dec 2006Memory Copy
This function copies vector (X,N) to (Z,N). The vectors shall not overlap.
Since we can simultaneously load one and store one 128-bit word, the minimal timing of memory copy should be 0.50 cycles/word. Actually, we seem to be limited to 1.50 cycles/word, as for example in the following trivial code:
shr N, 1
align 16
.a:
movapd xmm0, [X]
movapd [Z], xmm0
lea X, [X + 16]
lea Z, [Z + 16]
dec N
jnz .a
Despite numerous attemps, I could not find a faster pattern.
![]() Intel64 Multi-Precision : Unary OP | ![]() Top of Page | ![]() Intel64 Multi-Precision : Binary OP |




