AMD64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function applies unary operator OP (neg, not) to all words of a vector (Z,N).
These operators can only operate on a register. It means that for each word we need a memory read, the operation itself, and a memory write, or in total 3 slots in the execution units. Since we have 3 units per cycle, the asymptotic optimal timing is 1.00 cycle/word. But we have to increment the loop counter and the address register, taking an additional cycle. If P is the number of words processed per loop iteration, the optimal timing is (P+1)/P cycles/word, or 2.00 for P=1, 1.50 for P=2, 1.25 for P=4, and 1.125 for P=8.
Note that we didn't consider the instruction latencies and other CPU limitations. Usually, using out of order execution and register renaming, the CPU is able to schedule automatically the instructions from one loop to the next.
2 words per iteration shr N, 1 align 16 .a: mov AUX, [Z ] OP AUX mov [Z ], AUX mov AUX, [Z + 8] OP AUX mov [Z + 8], AUX lea Z, [Z + 16] dec N jnz .a
As expected, each iteration requires 3 cycles, yelding 1.50 cycles/word. Unrolling to our fixed limit of 8 words per iteration, we get:
8 words per iteration shr N, 3 align 16 .a: mov AUX, [Z ] OP AUX mov [Z ], AUX mov AUX, [Z + 8] OP AUX mov [Z + 8], AUX mov AUX, [Z + 16] OP AUX mov [Z + 16], AUX mov AUX, [Z + 24] OP AUX mov [Z + 24], AUX mov AUX, [Z + 32] OP AUX mov [Z + 32], AUX mov AUX, [Z + 40] OP AUX mov [Z + 40], AUX mov AUX, [Z + 48] OP AUX mov [Z + 48], AUX mov AUX, [Z + 56] OP AUX mov [Z + 56], AUX lea Z, [Z + 64] dec N jnz .a
This code runs at the predicted optimal speed of 1.125 cycles/word.
|AMD64 Multiprecision : Memory Zero||Top of Page||AMD64 Multiprecision : Memory Copy|