Tuesday, October 25, 2011

High Performance Assembly Language

The Core i7 CPU offers the Advanced Vector Extensions (AVX) which expand on the capabilities of the Streaming SIMD Extensions (SSE) instructions.  SSE added additional floating point and integer capabilities to the CPU, in particular the registers used by SSE (xmm0, xmm1,...) are 128 bits each and these bits can be used to store multiple integers of various lengths and also multiple floating point values.  The instructions operating on these multiple values are referred to as SIMD (single-instruction multiple-data) instructions.  With the AVX instructions the registers are extended to 256 bits, though the integer instructions only operate on 128 bit quantities.

Here I will present some simple code which will be extended profitably to improve performance.  The keys to success here are loop unrolling, using independent registers for the unrolling and using specialized instructions.  In the process there will be several versions of the basic function which perform better than the basic C code optimized with loop unrolling by the gcc compiler.

Basic C Code

The basic C function computes the sum of the double precision numbers in an array.

    double add_array ( double *a, long size )
    {
        int i;
        double sum = 0.0;

        for ( i = 0; i < size; i++ ) sum += a[i];

        return sum;
    }

Compiling this code with -O3 and testing with 100000 calls with the same array of size 100000 doubles requires 8.091 elapsed seconds.  These calls add 10 billion numbers in 8.091 seconds for approximately 1.2 GFLOPs.

Compiling the C code with -O3 and -funroll-all-loops and performing the same test requires 8.076 elapsed seconds for basically the same performance about 1.2 GFLOPs.

Here is the generated assembly code for the main portion of the loop:

.L3:
        addsd   (%rdi,%rax,8), %xmm0
        addsd   8(%rdi,%rax,8), %xmm0
        addsd   16(%rdi,%rax,8), %xmm0
        addsd   24(%rdi,%rax,8), %xmm0
        addsd   32(%rdi,%rax,8), %xmm0
        addsd   40(%rdi,%rax,8), %xmm0
        addsd   48(%rdi,%rax,8), %xmm0
        addsd   56(%rdi,%rax,8), %xmm0
        addq    $8, %rax
        cmpq    %rsi, %rax
        jne     .L3

The compiler unrolled the loop 8 times, but each unrolled portion adds to xmm0, which means that the instructions must be executed serially.  In this case that meant that the performance was not improved by unrolling.

It makes one wonder if there is enough work to perform for each data value or is the computation limited by main memory performance.  The amount of memory used by the array is 800,000 bytes and the total cache size of the CPU is 8MB, so the array should fit nicely in cache.  There should be no memory bandwidth limitation.  Testing with much larger array sizes could test the memory bandwidth as a limiting factor.

The problem seems to be that the compiler is using only xmm0 to add up the values for the array.


Simple Assembly Version

The simplest assembly code is a loop operating on 1 value per loop iteration:

        segment .text
        global add_array
add_array:
        xor     r8, r8
        mov     rcx, rsi
        vzeroall
.more:  
        addsd   xmm0, [rdi+r8]  ; sum += a[i]
        add     r8, 8
        sub     rsi, 1
        jnz     .more
        ret

Performing the same test with the simple assembly version requires 8.026 seconds which is again about 1.2 GFLOPs.  This matches the C version with or without loop unrolling.

Unrolled Assembly Version using 8 Different Registers

Adapting the assembly code to unroll the loop 8 times and compute 8 subsums in the loop gives the following code:

        segment .text
        global add_array
add_array:
        xor     r8, r8
        shr     rsi, 3
        vzeroall
.more:  
        addsd   xmm0, [rdi+r8]  ; sum += a[i]
        addsd   xmm1, [rdi+r8+8]
        addsd   xmm2, [rdi+r8+16]
        addsd   xmm3, [rdi+r8+24]
        addsd   xmm4, [rdi+r8+32]
        addsd   xmm5, [rdi+r8+40]
        addsd   xmm6, [rdi+r8+48]
        addsd   xmm7, [rdi+r8+56]
        add     r8, 64
        sub     rsi, 1
        jnz     .more
        addsd   xmm0, xmm1
        addsd   xmm2, xmm3
        addsd   xmm0, xmm2
        addsd   xmm4, xmm5
        addsd   xmm6, xmm7
        addsd   xmm4, xmm6
        addsd   xmm0, xmm4
        ret

You can see that registers xmm0-xmm7 are used to store 8 subsums.  After the loop these 8 registers are added together to produce a final sum.  This eliminates register dependency within the loop which allows the CPU flexibility in scheduling instructions for different computational units (all within the same core).  This code requires 3.313 seconds to execute which is about 3 GFLOPs.  This is about 2.4 times as fast as the C code with maximum optimization and loop unrolling.  That's a nice result showing the value of eliminating register dependency within loops.

AVX Version

The final version to be considered is an AVX version of the same code.  In this case the loop is unrolled 4 times, but each AVX instructions adds 4 doubles from the array to one of 4 AVX registers (ymm0-ymm3):

        segment .text
        global add_array
add_array:
        xor     r8, r8
        shr     rsi, 4
        mov     rcx, rsi
        vzeroall
.more:  
        vaddpd  ymm0, ymm0, [rdi+r8]
        vaddpd  ymm1, ymm1, [rdi+r8+32]
        vaddpd  ymm2, ymm2, [rdi+r8+64]
        vaddpd  ymm3, ymm3, [rdi+r8+96]
        add     r8, 128
        sub     rsi, 1
        jnz     .more
        vaddpd  ymm0, ymm0, ymm1
        vaddpd  ymm2, ymm2, ymm3
        vaddpd  ymm0, ymm0, ymm2
        vhaddpd ymm0, ymm0, ymm0
        vextractf128 xmm1, ymm0, 1
        addsd   xmm0, xmm1
        ret

Each of the 4 registers, ymm0-ymm3, holds 4 subsums, so this time there are 16 subsums being computed.  After the loop vaddpd is used 3 times to add ymm1, ymm2 and ymm3 to ymm0.  After these instructions there are 4 subsums in ymm0.  The vhaddpd instruction adds the first two values in ymm0 and the last 2 values in ymm0 giving 2 subsums.  The vextractf128 instruction moves the second half of ymm0 to the first half of ymm0.  Finally we can add two values in ymm0 and ymm1 to get the sum of all the numbers.

The AVX version of the code requires 2.381 seconds to perform the same computation as done for the other versions of the code which is about 4.2 GFLOPs.  This code is about 3.4 times as fast as the C code optimized with loop unrolling.  It is only a little faster than the unrolled assembly code, which is due to the relatively small amount of computation being performed for each array element.

A further test was performed using 8 YMM registers, but there was no real gain from using 8 YMM registers versus 4 YMM registers.

It is moderately surprising to see this sort of performance improvement with a trivial function.  In the author's assembly language textbook, similar techniques are used to compute correlation between 2 arrays which achieves 20.5 GFLOPs using 1 code of the same CPU.

No comments: