AMD64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Binary OP NOT
The next function will combine two vectors (X,N) and (Z,N) using binary operator op (and, or, xor), with the X operand bits inverted. The result is put back in (Z,N): Zi ⇐ Zi op not Xi.
The memory requirements are the same as the previous function. On additional execution slot is needed to invert each word, giving 4P+3 slots for one iteration processing P words. We still are limited by the memory accesses, but need to process 8 words per iteration to reach it. Since the processor had hard times scheduling the code, I replaced the "mov memory then not register" code by "mov register then xor memory", which has a smaller latency, and therefore makes scheduling easier.
8 words per iteration shr N, 3 xor FULL, FULL dec FULL ; FULL is all 1's mov AUX0, FULL mov AUX1, FULL align 16 .a: xor AUX0, [X ] xor AUX1, [X + 8] lea X, [X + 64] OP AUX0, [Z ] OP AUX1, [Z + 8] mov AUX2, FULL mov [Z ], AUX0 mov [Z + 8], AUX1 mov AUX3, FULL xor AUX2, [X - 48] xor AUX3, [X - 40] mov AUX0, FULL OP AUX2, [Z + 16] OP AUX3, [Z + 24] mov AUX1, FULL mov [Z + 16], AUX2 mov [Z + 24], AUX3 lea Z, [Z + 64] xor AUX0, [X - 32] xor AUX1, [X - 24] mov AUX2, FULL OP AUX0, [Z - 32] OP AUX1, [Z - 24] mov AUX3, FULL mov [Z - 32], AUX0 mov [Z - 24], AUX1 mov AUX0, FULL xor AUX2, [X - 16] xor AUX3, [X - 8] mov AUX1, FULL OP AUX2, [Z - 16] OP AUX3, [Z - 8] dec N mov [Z - 16], AUX2 mov [Z - 8], AUX3 jnz .a
This code runs at 12 cycles/iteration, or 1.50 cycle/word. There is only one unused execution slot in each iteration!
AMD64 Multiprecision : Binary OP | Top of Page | AMD64 Multiprecision : Scaling |