# Intel64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

## Binary OP

This function combines two vectors (X,N) and (Z,N) using binary operator op (add, sub, and, or, xor). The result is put back in (Z,N).

Obviously, the timing can't be better than Memory Copy. The following code can get us quite close to it:

```	shr     N, 2
align   16
.a:
movdqa  xmm0, [Z]
movdqa  xmm1, [X]
lea     X, [X + 32]
OP      xmm0, xmm1
movdqa  [Z], xmm0

movdqa  xmm0, [Z + 16]
movdqa  xmm1, [X - 16]
lea     Z, [Z + 32]
OP      xmm0, xmm1
movdqa  [Z - 16], xmm0

dec     N
jnz     .a
```

One iteration requires 6.50 cycles, so we are running at 1.64 cycles/word. OP is an SSE2 128-bit integer instruction: pand, por, pxor, paddq, psubq, pandn.

The case of addition/subtraction with carry is different for several reasons:

• There is no SSE2 intruction handling the carry propagation.
• The carry propagation implementation (adc, sbb) on Intel processors is really slow.
• Even worse: we need to explicitely save/restore the carry between loop iterations, even when using lea and dec to update counters (these instructions don't touch the C flag).

Jason W. Martin proposes a pattern similar to the following, processing 4 words per iteration:

```        shr     N, 2
; Initial carry bit in RAX
xor	rax, rax
align   16
.a:
bt      rax, 0

mov     AUX0, [Z]
mov     AUX1, [X]
OP      AUX0, AUX1
mov     [Z], AUX0

mov     AUX0, [Z + 8]
mov     AUX1, [X + 8]
OP      AUX0, AUX1
mov     [Z + 8], AUX0

mov     AUX0, [Z + 16]
mov     AUX1, [X + 16]
OP      AUX0, AUX1
mov     [Z + 16], AUX0

mov     AUX0, [Z + 24]
mov     AUX1, [X + 24]
OP      AUX0, AUX1
mov     [Z + 24], AUX0

setc    al

lea     X, [X + 32]
lea     Z, [Z + 32]
dec     N
jnz	.a
; Final carry bit in RAX
```

It runs at 2.63 cycles/word. The main speed gain comes from the bt and setc instructions, used to restore and save the carry flag from one iteration to the next. By further unrolling the loop to 8 words per iteration, we can reach 2.53 cycles/word.