The Core i7 CPU offers the Advanced Vector Extensions (AVX) which expand on the capabilities of the Streaming SIMD Extensions (SSE) instructions. SSE added additional floating point and integer capabilities to the CPU, in particular the registers used by SSE (xmm0, xmm1,...) are 128 bits each and these bits can be used to store multiple integers of various lengths and also multiple floating point values. The instructions operating on these multiple values are referred to as SIMD (single-instruction multiple-data) instructions. With the AVX instructions the registers are extended to 256 bits, though the integer instructions only operate on 128 bit quantities.
Here I will present some simple code which will be extended profitably to improve performance. The keys to success here are loop unrolling, using independent registers for the unrolling and using specialized instructions. In the process there will be several versions of the basic function which perform better than the basic C code optimized with loop unrolling by the gcc compiler.
Basic C Code
The basic C function computes the sum of the double precision numbers in an array.
double add_array ( double *a, long size )
{
int i;
double sum = 0.0;
for ( i = 0; i < size; i++ ) sum += a[i];
return sum;
}
Compiling this code with -O3 and testing with 100000 calls with the same array of size 100000 doubles requires 8.091 elapsed seconds. These calls add 10 billion numbers in 8.091 seconds for approximately 1.2 GFLOPs.
Compiling the C code with -O3 and -funroll-all-loops and performing the same test requires 8.076 elapsed seconds for basically the same performance about 1.2 GFLOPs.
Here is the generated assembly code for the main portion of the loop:
.L3:
addsd (%rdi,%rax,8), %xmm0
addsd 8(%rdi,%rax,8), %xmm0
addsd 16(%rdi,%rax,8), %xmm0
addsd 24(%rdi,%rax,8), %xmm0
addsd 32(%rdi,%rax,8), %xmm0
addsd 40(%rdi,%rax,8), %xmm0
addsd 48(%rdi,%rax,8), %xmm0
addsd 56(%rdi,%rax,8), %xmm0
addq $8, %rax
cmpq %rsi, %rax
jne .L3
The compiler unrolled the loop 8 times, but each unrolled portion adds to xmm0, which means that the instructions must be executed serially. In this case that meant that the performance was not improved by unrolling.
It makes one wonder if there is enough work to perform for each data value or is the computation limited by main memory performance. The amount of memory used by the array is 800,000 bytes and the total cache size of the CPU is 8MB, so the array should fit nicely in cache. There should be no memory bandwidth limitation. Testing with much larger array sizes could test the memory bandwidth as a limiting factor.
The problem seems to be that the compiler is using only xmm0 to add up the values for the array.
Simple Assembly Version
The simplest assembly code is a loop operating on 1 value per loop iteration:
segment .text
global add_array
add_array:
xor r8, r8
mov rcx, rsi
vzeroall
.more:
addsd xmm0, [rdi+r8] ; sum += a[i]
add r8, 8
sub rsi, 1
jnz .more
ret
Performing the same test with the simple assembly version requires 8.026 seconds which is again about 1.2 GFLOPs. This matches the C version with or without loop unrolling.
Unrolled Assembly Version using 8 Different Registers
Adapting the assembly code to unroll the loop 8 times and compute 8 subsums in the loop gives the following code:
segment .text
global add_array
add_array:
xor r8, r8
shr rsi, 3
vzeroall
.more:
addsd xmm0, [rdi+r8] ; sum += a[i]
addsd xmm1, [rdi+r8+8]
addsd xmm2, [rdi+r8+16]
addsd xmm3, [rdi+r8+24]
addsd xmm4, [rdi+r8+32]
addsd xmm5, [rdi+r8+40]
addsd xmm6, [rdi+r8+48]
addsd xmm7, [rdi+r8+56]
add r8, 64
sub rsi, 1
jnz .more
addsd xmm0, xmm1
addsd xmm2, xmm3
addsd xmm0, xmm2
addsd xmm4, xmm5
addsd xmm6, xmm7
addsd xmm4, xmm6
addsd xmm0, xmm4
ret
You can see that registers xmm0-xmm7 are used to store 8 subsums. After the loop these 8 registers are added together to produce a final sum. This eliminates register dependency within the loop which allows the CPU flexibility in scheduling instructions for different computational units (all within the same core). This code requires 3.313 seconds to execute which is about 3 GFLOPs. This is about 2.4 times as fast as the C code with maximum optimization and loop unrolling. That's a nice result showing the value of eliminating register dependency within loops.
AVX Version
The final version to be considered is an AVX version of the same code. In this case the loop is unrolled 4 times, but each AVX instructions adds 4 doubles from the array to one of 4 AVX registers (ymm0-ymm3):
segment .text
global add_array
add_array:
xor r8, r8
shr rsi, 4
mov rcx, rsi
vzeroall
.more:
vaddpd ymm0, ymm0, [rdi+r8]
vaddpd ymm1, ymm1, [rdi+r8+32]
vaddpd ymm2, ymm2, [rdi+r8+64]
vaddpd ymm3, ymm3, [rdi+r8+96]
add r8, 128
sub rsi, 1
jnz .more
vaddpd ymm0, ymm0, ymm1
vaddpd ymm2, ymm2, ymm3
vaddpd ymm0, ymm0, ymm2
vhaddpd ymm0, ymm0, ymm0
vextractf128 xmm1, ymm0, 1
addsd xmm0, xmm1
ret
Each of the 4 registers, ymm0-ymm3, holds 4 subsums, so this time there are 16 subsums being computed. After the loop vaddpd is used 3 times to add ymm1, ymm2 and ymm3 to ymm0. After these instructions there are 4 subsums in ymm0. The vhaddpd instruction adds the first two values in ymm0 and the last 2 values in ymm0 giving 2 subsums. The vextractf128 instruction moves the second half of ymm0 to the first half of ymm0. Finally we can add two values in ymm0 and ymm1 to get the sum of all the numbers.
The AVX version of the code requires 2.381 seconds to perform the same computation as done for the other versions of the code which is about 4.2 GFLOPs. This code is about 3.4 times as fast as the C code optimized with loop unrolling. It is only a little faster than the unrolled assembly code, which is due to the relatively small amount of computation being performed for each array element.
A further test was performed using 8 YMM registers, but there was no real gain from using 8 YMM registers versus 4 YMM registers.
It is moderately surprising to see this sort of performance improvement with a trivial function. In the author's assembly language textbook, similar techniques are used to compute correlation between 2 arrays which achieves 20.5 GFLOPs using 1 code of the same CPU.
No comments:
Post a Comment