Intel64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Memory Copy

This function copies vector (X,N) to (Z,N). The vectors shall not overlap.

Since we can simultaneously load one and store one 128-bit word, the minimal timing of memory copy should be 0.50 cycles/word. Actually, we seem to be limited to 1.50 cycles/word, as for example in the following trivial code:

	shr     N, 1
        align   16
.a:
        movapd  xmm0, [X]
        movapd  [Z], xmm0
        lea     X, [X + 16]
        lea     Z, [Z + 16]
        dec     N
        jnz     .a

Despite numerous attemps, I could not find a faster pattern.