# AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

## Unary OP

This function applies unary operator OP (neg, not) to all words of a vector (Z,N).

These operators can only operate on a register. It means that for each word we need a memory read, the operation itself, and a memory write, or in total 3 slots in the execution units. Since we have 3 units per cycle, the asymptotic optimal timing is 1.00 cycle/word. But we have to increment the loop counter and the address register, taking an additional cycle. If P is the number of words processed per loop iteration, the optimal timing is (P+1)/P cycles/word, or 2.00 for P=1, 1.50 for P=2, 1.25 for P=4, and 1.125 for P=8.

Note that we didn't consider the instruction latencies and other CPU limitations. Usually, using out of order execution and register renaming, the CPU is able to schedule automatically the instructions from one loop to the next.

```2 words per iteration
shr	N, 1
align   16
.a:
mov	AUX, [Z    ]
OP	AUX
mov	[Z    ], AUX
mov	AUX, [Z + 8]
OP	AUX
mov	[Z + 8], AUX
lea     Z, [Z + 16]
dec     N
jnz	.a
```

As expected, each iteration requires 3 cycles, yelding 1.50 cycles/word. Unrolling to our fixed limit of 8 words per iteration, we get:

```8 words per iteration
shr	N, 3
align   16
.a:
mov	AUX, [Z     ]
OP	AUX
mov	[Z     ], AUX
mov	AUX, [Z +  8]
OP	AUX
mov	[Z +  8], AUX
mov	AUX, [Z + 16]
OP	AUX
mov	[Z + 16], AUX
mov	AUX, [Z + 24]
OP	AUX
mov	[Z + 24], AUX
mov	AUX, [Z + 32]
OP	AUX
mov	[Z + 32], AUX
mov	AUX, [Z + 40]
OP	AUX
mov	[Z + 40], AUX
mov	AUX, [Z + 48]
OP	AUX
mov	[Z + 48], AUX
mov	AUX, [Z + 56]
OP	AUX
mov	[Z + 56], AUX
lea	Z, [Z + 64]
dec	N
jnz	.a
```

This code runs at the predicted optimal speed of 1.125 cycles/word.