AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Binary OP NOT

The next function will combine two vectors (X,N) and (Z,N) using binary operator op (and, or, xor), with the X operand bits inverted. The result is put back in (Z,N): Z_i ⇐ Z_i op not X_i.

The memory requirements are the same as the previous function. On additional execution slot is needed to invert each word, giving 4P+3 slots for one iteration processing P words. We still are limited by the memory accesses, but need to process 8 words per iteration to reach it. Since the processor had hard times scheduling the code, I replaced the "mov memory then not register" code by "mov register then xor memory", which has a smaller latency, and therefore makes scheduling easier.

8 words per iteration
	shr	N, 3
	xor	FULL, FULL
	dec	FULL		; FULL is all 1's
	mov	AUX0, FULL
	mov	AUX1, FULL
        align   16
.a:
        xor	AUX0, [X     ]
        xor	AUX1, [X +  8]
	lea	X, [X + 64]

	OP	AUX0, [Z     ]
	OP	AUX1, [Z +  8]
	mov	AUX2, FULL

	mov	[Z     ], AUX0
	mov	[Z +  8], AUX1
	mov	AUX3, FULL

        xor	AUX2, [X - 48]
        xor	AUX3, [X - 40]
	mov	AUX0, FULL

	OP	AUX2, [Z + 16]
	OP	AUX3, [Z + 24]
	mov	AUX1, FULL

	mov	[Z + 16], AUX2
	mov	[Z + 24], AUX3
	lea     Z, [Z + 64]

        xor	AUX0, [X - 32]
        xor	AUX1, [X - 24]
	mov	AUX2, FULL

	OP	AUX0, [Z - 32]
	OP	AUX1, [Z - 24]
	mov	AUX3, FULL

	mov	[Z - 32], AUX0
	mov	[Z - 24], AUX1
	mov	AUX0, FULL

        xor	AUX2, [X - 16]
        xor	AUX3, [X -  8]
	mov	AUX1, FULL

	OP	AUX2, [Z - 16]
	OP	AUX3, [Z -  8]
	dec	N

	mov	[Z - 16], AUX2
	mov	[Z -  8], AUX3

        jnz	.a

This code runs at 12 cycles/iteration, or 1.50 cycle/word. There is only one unused execution slot in each iteration!