Intel64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Unary OP

This function applies unary operator OP (neg, not) to all words of a vector (Z,n).

Conroe

Using 64-bit general purpose registers, we can reach the maximum 64-bit memory bandwidth after unrolling the trivial load+op+store loop to 4 words/iteration:

```        shr     N, 2
align   16
.a:
mov     AUX, [Z]
OP      AUX
mov     [Z], AUX
mov     AUX, [Z + 8]
OP      AUX
mov     [Z + 8], AUX

lea     Z, [Z + 32]

mov     AUX, [Z - 16]
OP      AUX
mov     [Z - 16], AUX
mov     AUX, [Z - 8]
OP      AUX
mov     [Z - 8], AUX

dec     N
jnz	.a
```

This runs in 4 cycles/iteration, or 1.00 cycle/word. This can be further reduced using 128-bit XMM registers. Independently of other limiting factors, the memory bandwidth corresponds to 0.50 cycles/word.

The following code is an optimal version of the SSE2 variant for the case OP=not, running at 0.50 cycles/word:

```        ; Load 0xFFFFFFFF.FFFFFFFF.FFFFFFFF.FFFFFFFF to xmm4
xor     AUX, AUX
not     AUX
movd    xmm4, AUX
punpcklqdq  xmm4, xmm4
shr     N, 3
movdqa  xmm0, xmm4
align   16
.a:
pxor    xmm0, [Z     ]
movdqa  xmm1, xmm4
movdqa  [Z     ], xmm0

pxor    xmm1, [Z + 16]
movdqa  xmm0, xmm4
movdqa  [Z + 16], xmm1

pxor    xmm0, [Z + 32]
movdqa  xmm1, xmm4
movdqa  [Z + 32], xmm0

lea     Z, [Z + 64]

pxor    xmm1, [Z - 16]
movdqa  xmm0, xmm4
movdqa  [Z - 16], xmm1

dec     N
jnz	.a
```