AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

These pages relate my experiments on 64-bit multiprecision arithmetic running on Athlon 64 processors.

I describe a set of basic memory, arithmetic, and logic operations on (long) vectors of 64-bit words. The vector (A,N) is a vector of N consecutive 64-bit words A₀, A₁,..., A_N-1. Such a vector represents an unsigned integer A₀+A₁.2⁶⁴+...+A_N-1.2^64(N-1). All integers in the interval 0..2^64N-1 are representable, and have an unique representation.

To simplify the functions, I impose N to be a multiple of 8: the size of a vector is a multiple of 64 bytes (cache line). Vectors must be 8-byte aligned on the AMD64, and 16-byte aligned on the Intel64, where 128-bit SSE instructions are used.

I used information from the following sources:

AMD Manuals,
Intel Manuals,
Agner Fog's Software Optimization Resources page, containing a lot of precious and easy to find information.

The GMP library already provides a very fast and comprehensive set of low-level multiprecision functions (mpn). I will often use it as a starting point for further optimizations, along with some optimized memory function of the GNU C Library.

Execution time for each function is evaluated for 8KB vectors (N=1024 words), using the rdtsc instruction surrounded by two cpuid instructions to ensure all instructions are counted (because rdtsc may be executed out-of-order, but cpuid can't be). With this size we can get a very precise measure of the number of required cycles per word. Each function is called a large number of times, and we take the average number of cycles.

Here are the best timings I could get so far:

Function	glibc	gmp	this page
Z ← 0	0.50	-	0.50
Z ← op Z (not neg)	-	-	1.13
Z ← X	1.50	-	1.50
Z ← Z op X (+ - and or xor)	-	2.00	1.50
Z ← Z op not X (and or xor)	-	-	1.50
Z ← Z * k	-	3.00	3.00
Z ← Z * k op X (+ -)	-	3.13	3.13
Z ← Z << k	-	2.50	2.25
Z ← Z / k	-	22.00	17.00

glibc refers to glibc 2.4, and gmp to gmp 4.2.1 with P. Gaudry's patch.