AMD64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Division by a scalar
This function divides the integer represented by (Z,N) by a single word k. Let's start with a naive implementation, using the 64-bit div instruction:
; Loop from msb to lsb lea Z, [Z + 8*N] ; rem is the input remainder, in 0..k-1 xor REM, REM align 16 .a lea Z, [Z - 8] mov RDX, REM mov RAX, [Z] div K mov [Z], RAX mov REM, RDX dec N jnz .a ; rem is the output remainder, in 0..k-1
The div instruction is very slow, and we get near 74 cycles/word! Efficient alternatives require using mul instead. The trick to do this is to compute once a pseudo inverse k'=264/k, and multiply by k' instead of divide by k. At each step, we multiply ai+264r (the current word combined with the incoming remainder r in 0..k-1) by k' to get a close (and always lower) guess of the effective quotient. Then we multiply it by k to get the next remainder, and make corrections until the remainder is less than k. Actually, we shift all values to maximize the number of exact bits obtained by a single product by the pseudo inverse. The exact quotient and remainder are then determined by additions iteratively.
; Loop from msb to lsb lea Z, [Z + 8*N] ; rem is the input remainder, in 0..k-1 xor REM, REM ; Build the pseudo-inverse of k: 2^(127-)/k ; First determine the shift in rcx to get the max number of bits in kinv xor RCX, RCX mov RAX, K mov RDX, 1 .kinv1: inc RCX ror RDX, 1 shl RAX, 1 jnc .kinv1 dec RCX ; Here, rcx is a left shift moving the msb of k to bit 63 mov MASK, 1 shl MASK, cl dec MASK ror MASK, cl ; rcx bits at msb ; Then divide 2^(64+cx) by k (rdx already ok) xor RAX, RAX div K mov KINV, RAX ; Here kinv has 64 bits align 16 .a: ; Get 64 bits of quotient approx, multiplying ; most significant word of (rem*2^64+input) mov RAX, MASK and RAX, [Z - 8] or RAX, REM rol RAX, cl mul KINV rcl RAX, 1 rcl RDX, 1 mov QUOT, RDX ; Multiply by k and subtract to get remainder ; Subtraction must be done on two words mov HREM, REM mov REM, [Z - 8] mov RAX, QUOT mul K sub REM, RAX sbb HREM, RDX jz .b ; high word is 0, goto adjust on single word ; Adjust quotient and remainder on two words .d: inc QUOT sub REM, K sbb HREM, 0 jnz .d ; Adjust quotient and remainder on single word .b: cmp REM, K jb .c ; rem in 0..k-1, OK sub REM, K inc QUOT jmp .b ; Store result .c: mov [Z - 8], QUOT lea Z, [Z - 8] dec N jnz .a
This code runs at 17 cycles/word.
AMD64 Multiprecision : Left and right shifts | Top of Page | Intel64 Multiprecision |