AMD64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Left and right shifts
These functions left/right shift the 64N-bit long vector (Z,N) by K=0..63 bits.
Shifting by only one bit can be done in 1.50 cycle/word using the Binary OP pattern. Longer shifts require more operations. Here is a first version using only general purpose registers:
xor CARRY, CARRY ; CARRY is the input "carry" (k lsb bits), can be passed as argument mov rcx, K mov MASK, 1 shl MASK, cl dec MASK ; Has k ones as low significant bits shr N, 1 align 16 .a: mov AUX1, [Z] mov AUX2, [Z + 8] mov AUX3, MASK mov AUX4, MASK rol AUX1, cl rol AUX2, cl and AUX3, AUX1 and AUX4, AUX2 xor AUX1, CARRY xor AUX2, AUX4 xor AUX1, AUX3 xor AUX2, AUX3 mov CARRY, AUX4 mov [Z], AUX1 mov [Z + 8], AUX2 lea Z, [Z + 16] dec N jnz .a ; CARRY is the output carry, can be returned
This code runs in 6 cycles/iteration, or 3.00 cycle/word. Here again the processor does a good scheduling job, because we have 17 slots among the 18 slots available in 6 cycles. This code first rotates the words to put bits in place, then moves the K least significant bits using binary operations. Another approach using two shifts for each word requires the same running time.
gmp 4.2.1 code runs at 2.50 cycles/word, making use of the 64-bit multimedia extensions (MMX). The gmp code is similar to the following:
movd mm6, K ; mm6 is k mov AUX, K xor AUX, 63 inc AUX movd mm7, AUX ; mm7 is 64-k pxor mm2, mm2 ; Input carry in mm2 shr N, 1 align 16 .a: movq mm0, [Z] movq mm1, [Z] psllq mm0, mm6 psrlq mm1, mm7 por mm0, mm2 movq [Z], mm0 movq mm0, [Z + 8] movq mm2, [Z + 8] psllq mm0, mm6 psrlq mm2, mm7 por mm0, mm1 movq [Z + 8], mm0 lea Z, [Z + 16] dec N jnz .a ; Output carry in mm2 emms
In this loop we have 3 kinds of instructions: 2 memory read movq requiring one slot on any of the three (FADD, FMUL, FMISC) floating point execution units, 2 memory write movq requiring one slot on the FMISC unit, and the other 2 movq, 2 psllq, 2 psrlq, 2 por instructions requiring one slot on the FADD or FMUL units.
It appears than each word processing block requires 2 cycles, and there is an extra cycle for the lea and dec. Actually, unrolling the loop to 4 words per iteration gives a 2.25 cycles/word code:
movd mm6, K ; mm6 is k mov AUX, K xor AUX, 63 inc AUX movd mm7, AUX ; mm7 is 64-k pxor mm2, mm2 ; Input carry in mm2 shr N, 2 align 16 .a: movq mm0, [Z] movq mm1, [Z] psllq mm0, mm6 psrlq mm1, mm7 por mm0, mm2 movq [Z], mm0 movq mm0, [Z + 8] movq mm2, [Z + 8] psllq mm0, mm6 psrlq mm2, mm7 por mm0, mm1 movq [Z + 8], mm0 movq mm0, [Z + 16] movq mm1, [Z + 16] psllq mm0, mm6 psrlq mm1, mm7 por mm0, mm2 movq [Z + 16], mm0 movq mm0, [Z + 24] movq mm2, [Z + 24] psllq mm0, mm6 psrlq mm2, mm7 por mm0, mm1 movq [Z + 24], mm0 lea Z, [Z + 32] dec N jnz .a ; Output carry in mm2 emms
However, further unrolling to 8 words per iteration goes back to 2.50 cycles/word.
Right shift is very similar, with the difference that access to Z must be reversed.
AMD64 Multiprecision : Multiply and accumulate | Top of Page | AMD64 Multiprecision : Division by a scalar |