Intel64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Binary OP

This function combines two vectors (X,N) and (Z,N) using binary operator op (add, sub, and, or, xor). The result is put back in (Z,N).

Obviously, the timing can't be better than Memory Copy. The following code can get us quite close to it:

	shr     N, 2
        align   16
.a:
        movdqa  xmm0, [Z]
        movdqa  xmm1, [X]
        lea     X, [X + 32]
        OP      xmm0, xmm1
        movdqa  [Z], xmm0

        movdqa  xmm0, [Z + 16]
        movdqa  xmm1, [X - 16]
        lea     Z, [Z + 32]
        OP      xmm0, xmm1
        movdqa  [Z - 16], xmm0

        dec     N
        jnz     .a

One iteration requires 6.50 cycles, so we are running at 1.64 cycles/word. OP is an SSE2 128-bit integer instruction: pand, por, pxor, paddq, psubq, pandn.

The case of addition/subtraction with carry is different for several reasons:

There is no SSE2 intruction handling the carry propagation.
The carry propagation implementation (adc, sbb) on Intel processors is really slow.
Even worse: we need to explicitely save/restore the carry between loop iterations, even when using lea and dec to update counters (these instructions don't touch the C flag).

Jason W. Martin proposes a pattern similar to the following, processing 4 words per iteration:

        shr     N, 2
	; Initial carry bit in RAX
	xor	rax, rax
        align   16
.a:
        bt      rax, 0

        mov     AUX0, [Z]
        mov     AUX1, [X]
        OP      AUX0, AUX1
        mov     [Z], AUX0

        mov     AUX0, [Z + 8]
        mov     AUX1, [X + 8]
        OP      AUX0, AUX1
        mov     [Z + 8], AUX0

        mov     AUX0, [Z + 16]
        mov     AUX1, [X + 16]
        OP      AUX0, AUX1
        mov     [Z + 16], AUX0

        mov     AUX0, [Z + 24]
        mov     AUX1, [X + 24]
        OP      AUX0, AUX1
        mov     [Z + 24], AUX0

        setc    al

        lea     X, [X + 32]
        lea     Z, [Z + 32]
        dec     N
        jnz	.a
	; Final carry bit in RAX

It runs at 2.63 cycles/word. The main speed gain comes from the bt and setc instructions, used to restore and save the carry flag from one iteration to the next. By further unrolling the loop to 8 words per iteration, we can reach 2.53 cycles/word.