AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Memory Copy

This function copies vector (X,N) to (Z,N). The vectors shall not overlap.

The AMD64 Optimization Manual (section 5.13) provides the following code:

4 words per iteration
	shr     N, 2
        align   16
.a:
        mov     AUX0, [X     ]
        mov     AUX1, [X +  8]
        lea     X, [X + 32]
        mov     [Z     ], AUX0
        mov     [Z +  8], AUX1
        lea     Z, [Z + 32]
        mov     AUX0, [X - 16]
        mov     AUX1, [X -  8]
        dec     N
        mov     [Z - 16], AUX0
        mov     [Z -  8], AUX1
        jnz     .a

It runs in 6 cycles/iteration, or 1.50 cycle/word. With the limit of two memory read/write per cycle, the optimal speed would be 1.00 cycle/word. Despite a lot of experiments, I could not find a faster sequence...