AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Division by a scalar

This function divides the integer represented by (Z,N) by a single word k. Let's start with a naive implementation, using the 64-bit div instruction:

        ; Loop from msb to lsb
        lea     Z, [Z + 8*N]
        ; rem is the input remainder, in 0..k-1
        xor     REM, REM
        align   16
.a
        lea     Z, [Z - 8]
        mov     RDX, REM
        mov     RAX, [Z]
        div     K
        mov     [Z], RAX
        mov     REM, RDX
        dec     N
        jnz     .a
        ; rem is the output remainder, in 0..k-1

The div instruction is very slow, and we get near 74 cycles/word! Efficient alternatives require using mul instead. The trick to do this is to compute once a pseudo inverse k'=2⁶⁴/k, and multiply by k' instead of divide by k. At each step, we multiply a_i+2⁶⁴r (the current word combined with the incoming remainder r in 0..k-1) by k' to get a close (and always lower) guess of the effective quotient. Then we multiply it by k to get the next remainder, and make corrections until the remainder is less than k. Actually, we shift all values to maximize the number of exact bits obtained by a single product by the pseudo inverse. The exact quotient and remainder are then determined by additions iteratively.

        ; Loop from msb to lsb
        lea     Z, [Z + 8*N]
        ; rem is the input remainder, in 0..k-1
        xor     REM, REM
        ; Build the pseudo-inverse of k: 2^(127-)/k
        ; First determine the shift in rcx to get the max number of bits in kinv
        xor     RCX, RCX
        mov     RAX, K
        mov     RDX, 1
.kinv1:
        inc     RCX
        ror     RDX, 1
        shl     RAX, 1
        jnc     .kinv1
        dec     RCX
        ; Here, rcx is a left shift moving the msb of k to bit 63
        
        mov     MASK, 1
        shl     MASK, cl
        dec     MASK
        ror     MASK, cl ; rcx bits at msb
        
        ; Then divide 2^(64+cx) by k (rdx already ok)
        xor     RAX, RAX
        div     K
        mov     KINV, RAX
        ; Here kinv has 64 bits
        align   16
.a:
        ; Get 64 bits of quotient approx, multiplying
        ; most significant word of (rem*2^64+input)
        mov     RAX, MASK
        and     RAX, [Z - 8]
        or      RAX, REM
        rol     RAX, cl
        mul     KINV
        rcl     RAX, 1
        rcl     RDX, 1
        mov     QUOT, RDX
        
        ; Multiply by k and subtract to get remainder
        ; Subtraction must be done on two words
        mov     HREM, REM
        mov     REM, [Z - 8]
        mov     RAX, QUOT
        mul     K
        sub     REM, RAX
        sbb     HREM, RDX
        jz      .b  ; high word is 0, goto adjust on single word
        ; Adjust quotient and remainder on two words
.d:     inc     QUOT
        sub     REM, K
        sbb     HREM, 0
        jnz     .d
        
        ; Adjust quotient and remainder on single word
.b:     cmp     REM, K
        jb      .c ; rem in 0..k-1, OK
        sub     REM, K
        inc     QUOT
        jmp     .b
        
        ; Store result
.c:     mov     [Z - 8], QUOT
        lea     Z, [Z - 8]
        
	dec	N
	jnz	.a

This code runs at 17 cycles/word.