AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Scaling

This function multiplies the number represented by vector (Z,N) by a constant K.

Processing each word requires at least 6 execution slots: loading RAX (1 slot), mul (2 slots), unloading/accumulating RAX and RDX (2 slots), storing the result (1 slot). At least one additional slot is required to handle correctly the carry propagation. Consequently, we can expect timings near 3.00 cycles/word. Actually, 3.00 itself is quite hard to reach. It can be done with the following pattern, processing two words per iteration:

        shr     N, 1
        xor     AUX0, AUX0
        ; AUX0: input carry in 0..K-1
        align   16
.a:
        mov     RAX, [Z]
        mul     K
        xor     AUX1, AUX1
	lea	Z, [Z + 16]
        add     AUX0, RAX
        adc     AUX1, RDX
        mov     [Z - 16], AUX0
        
        mov     RAX, [Z - 8]
        mul     K
        xor     AUX0, AUX0
        add     AUX1, RAX
        adc     AUX0, RDX
        mov     [Z - 8], AUX1
        
	dec	N
	jnz	.a
	; AUX0: output carry in 0..K-1

It runs in 6 cycles/iteration, or 3.00 cycles/word. All 14 instructions occupy 16 slots in the integer execution units, and are scheduled in 18 slots (6 cycles).

The input carry can be optionally passed as argument, and in this case must be in 0..K-1. The output carry can be returned too, and will be in the same range.

I could not find a pattern under 3.00 cycles/word, by unrolling the loop further, or "pipelining" the pattern.