AMD64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function copies vector (X,N) to (Z,N). The vectors shall not overlap.
The AMD64 Optimization Manual (section 5.13) provides the following code:
4 words per iteration shr N, 2 align 16 .a: mov AUX0, [X ] mov AUX1, [X + 8] lea X, [X + 32] mov [Z ], AUX0 mov [Z + 8], AUX1 lea Z, [Z + 32] mov AUX0, [X - 16] mov AUX1, [X - 8] dec N mov [Z - 16], AUX0 mov [Z - 8], AUX1 jnz .a
It runs in 6 cycles/iteration, or 1.50 cycle/word. With the limit of two memory read/write per cycle, the optimal speed would be 1.00 cycle/word. Despite a lot of experiments, I could not find a faster sequence...
|AMD64 Multiprecision : Unary OP||Top of Page||AMD64 Multiprecision : Binary OP|