AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Left and right shifts

These functions left/right shift the 64N-bit long vector (Z,N) by K=0..63 bits.

Shifting by only one bit can be done in 1.50 cycle/word using the Binary OP pattern. Longer shifts require more operations. Here is a first version using only general purpose registers:

	xor	CARRY, CARRY
	; CARRY is the input "carry" (k lsb bits), can be passed as argument
	mov	rcx, K
	mov	MASK, 1
	shl	MASK, cl
	dec	MASK	; Has k ones as low significant bits
	shr	N, 1
        align   16
.a:
	mov	AUX1, [Z]
	mov	AUX2, [Z + 8]
	mov	AUX3, MASK
	mov	AUX4, MASK
	rol	AUX1, cl
	rol	AUX2, cl
	and	AUX3, AUX1
	and	AUX4, AUX2
	xor	AUX1, CARRY
	xor	AUX2, AUX4
	xor	AUX1, AUX3
	xor	AUX2, AUX3
	mov	CARRY, AUX4
	mov	[Z], AUX1
	mov	[Z + 8], AUX2
	lea	Z, [Z + 16]
	dec	N
	jnz	.a
	; CARRY is the output carry, can be returned

This code runs in 6 cycles/iteration, or 3.00 cycle/word. Here again the processor does a good scheduling job, because we have 17 slots among the 18 slots available in 6 cycles. This code first rotates the words to put bits in place, then moves the K least significant bits using binary operations. Another approach using two shifts for each word requires the same running time.

gmp 4.2.1 code runs at 2.50 cycles/word, making use of the 64-bit multimedia extensions (MMX). The gmp code is similar to the following:

	movd    mm6, K   ; mm6 is k
	mov	AUX, K
	xor     AUX, 63
	inc     AUX
	movd    mm7, AUX ; mm7 is 64-k
	pxor    mm2, mm2 ; Input carry in mm2

	shr	N, 1
        align   16
.a:
        movq    mm0, [Z]
        movq    mm1, [Z]
        psllq   mm0, mm6
        psrlq   mm1, mm7
        por     mm0, mm2
        movq    [Z], mm0
        
        movq    mm0, [Z + 8]
        movq    mm2, [Z + 8]
        psllq   mm0, mm6
        psrlq   mm2, mm7
        por     mm0, mm1
        movq    [Z + 8], mm0

	lea	Z, [Z + 16]
	dec	N
	jnz	.a
	; Output carry in mm2
	emms

In this loop we have 3 kinds of instructions: 2 memory read movq requiring one slot on any of the three (FADD, FMUL, FMISC) floating point execution units, 2 memory write movq requiring one slot on the FMISC unit, and the other 2 movq, 2 psllq, 2 psrlq, 2 por instructions requiring one slot on the FADD or FMUL units.

It appears than each word processing block requires 2 cycles, and there is an extra cycle for the lea and dec. Actually, unrolling the loop to 4 words per iteration gives a 2.25 cycles/word code:

	movd    mm6, K   ; mm6 is k
	mov	AUX, K
	xor     AUX, 63
	inc     AUX
	movd    mm7, AUX ; mm7 is 64-k
	pxor    mm2, mm2 ; Input carry in mm2

	shr	N, 2
        align   16
.a:
        movq    mm0, [Z]
        movq    mm1, [Z]
        psllq   mm0, mm6
        psrlq   mm1, mm7
        por     mm0, mm2
        movq    [Z], mm0
        
        movq    mm0, [Z + 8]
        movq    mm2, [Z + 8]
        psllq   mm0, mm6
        psrlq   mm2, mm7
        por     mm0, mm1
        movq    [Z + 8], mm0
        
        movq    mm0, [Z + 16]
        movq    mm1, [Z + 16]
        psllq   mm0, mm6
        psrlq   mm1, mm7
        por     mm0, mm2
        movq    [Z + 16], mm0
        
        movq    mm0, [Z + 24]
        movq    mm2, [Z + 24]
        psllq   mm0, mm6
        psrlq   mm2, mm7
        por     mm0, mm1
        movq    [Z + 24], mm0

	lea	Z, [Z + 32]
	dec	N
	jnz	.a
	; Output carry in mm2
	emms

However, further unrolling to 8 words per iteration goes back to 2.50 cycles/word.

Right shift is very similar, with the difference that access to Z must be reversed.