SSE2 Instructions

For the past several weeks I have been studying X86 assembly language, mainly because I wanted to update my knowledge on the assembly language to match the latest CPU technology. I had previosly taken an X86 assembly language course at NCSU roughly a year ago, but the course only covered 8086 instruction set, and used the MASM version 6.0 as the assembler which is only good for writing MS-DOS applications. I wanted to at least learn how to do floating-point calculations in assembly, and do it in GNU assembly so that my apps would run on Linux.

There are quite a few extensions to the core X86 instructions, such as FPU, MMX, SSE, and SSE2. The FPU takes care of normal floating point calculations since 80386, MMX for operating multiple integer calculations in a single CPU cycle, SSE for multiple single-precision calculations, and SSE2 for multiple double-precision calculations (again, in a single CPU cycle). Since software these days, and OO.o in particular, seem to do almost all of floating point calculations in double-precision, I decided to give SSE2 a little benchmark test.

Here is how I did it. I wrote some simple mathematical routines in C, compiled it normally with gcc with -O1 optimization. Then I had gcc generate an assembly code of that routine, cleaned it up a bit and replaced several instructions with SSE2 instructions, reassembled it into an executable to run benchmark.

Here is the original C code for the routine:

void array_multiply(double* array1, double* array2, unsigned int size)
{
    // Make nloop a multiple of 2.
    unsigned int nloop = size/2;
    nloop += nloop;
 
    unsigned int i = 0;
    for (; i < nloop; i += 2)
    {
        array1[i]   *= array2[i] * array2[i] * array2[i];
        array1[i+1] *= array2[i+1] * array2[i+1] * array2[i+1];
    }
}

and this is the assembly instructions that gcc generated (with -O1):

    .text
    .align 2
.globl array_multiply
    .type   array_multiply, @function
array_multiply:
.LFB13:
    pushl   %ebp
.LCFI0:
    movl    %esp, %ebp
.LCFI1:
    pushl   %ebx
.LCFI2:
    movl    8(%ebp), %edx
    movl    12(%ebp), %ebx
    movl    16(%ebp), %ecx
    andl    $-2, %ecx
    je      .L5
    movl    $0, %eax
.L4:
    fldl    (%ebx,%eax,8)
    fld     %st(0)
    fmul    %st(1), %st
    fmulp   %st, %st(1)
    fmull   (%edx,%eax,8)
    fstpl   (%edx,%eax,8)
    fldl    8(%ebx,%eax,8)
    fld     %st(0)
    fmul    %st(1), %st
    fmulp   %st, %st(1)
    fmull   8(%edx,%eax,8)
    fstpl   8(%edx,%eax,8)
    addl    $2, %eax
    cmpl    %eax, %ecx
    ja      .L4
.L5:
    popl    %ebx
    popl    %ebp
    ret

It does all the calculations using FPU instructions. And here is the assembly code after I replaced the FPU instructions with SSE2 ones:

.section .text
.align 16
.globl array_multiply
    .type   array_multiply, @function
array_multiply:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %ebx
    movl    8(%ebp), %edx           # pointer to array1
    movl    12(%ebp), %ebx          # pointer to array2
    movl    16(%ebp), %ecx          # size
    andl    $-2, %ecx               # make the size a multiple of 2
    je  .L5
    movl    $0, %eax                # i = 0
.L4:
    movapd  (%edx,%eax,8), %xmm0    # SSE2
    movupd  (%ebx,%eax,8), %xmm1    # SSE2
    mulpd   %xmm1, %xmm0            # SSE2
    mulpd   %xmm1, %xmm0            # SSE2
    mulpd   %xmm1, %xmm0            # SSE2
    movapd  %xmm0, (%edx,%eax,8)    # SSE2
    addl    $2, %eax                # i += 2
    cmpl    %eax, %ecx
    ja  .L4
.L5:
    popl    %ebx
    popl    %ebp
    ret

I then used the following main C++ code

void print_array(double* array, unsigned int size)
{
    cout << "{ ";
    for (unsigned int i = 0; i < size; ++i)
    {
        if (i)
            cout << ", ";
        cout << array[i];
    }
    cout << " }" << endl;
}
 
int main()
{
    double myarray1[] = {10.5, 50.0, 25.0, 10.0, 2345.4848, 594.23, 0.4, 87.2};
    double myarray2[] = {1.2, 50.0, 1.5, 10.0, 120.9, 44.09, 874.234, 233.333};
    unsigned int array_size = 8;
 
    cout << "myarray1 = ";
    print_array(myarray1, array_size);
    cout << "myarray2 = ";
    print_array(myarray2, array_size);
 
    double myarray[array_size];
 
    for (long counter = 0; counter < 99999999; ++counter)
    {
        for (unsigned int i = 0; i < array_size; ++i)
            // To prevent calculation results from being cached.
            myarray[i] = myarray1[i] + 0.000000001*counter;
        array_multiply(myarray, myarray2, array_size);
    }
 
    for (unsigned int i = 0; i < array_size; ++i)
        cout << i << " t " << myarray[i] << endl;
}

to call both the C version and the assembly with SSE2 version to compare performance. The executables with the original C version and the SSE2 version are named test_c and test_sse, respectively. Here is the result (on my machine):

$ time ./test_c
myarray1 = { 10.5, 50, 25, 10, 2345.48, 594.23, 0.4, 87.2 }
myarray2 = { 1.2, 50, 1.5, 10, 120.9, 44.09, 874.234, 233.333 }
0        18.3168
1        6.2625e+06
2        84.7125
3        10100
4        4.14505e+09
5        5.09387e+07
6        3.34082e+08
7        1.10903e+09
 
real    0m3.308s
user    0m3.292s
sys     0m0.012s
 
$ time ./test_sse
myarray1 = { 10.5, 50, 25, 10, 2345.48, 594.23, 0.4, 87.2 }
myarray2 = { 1.2, 50, 1.5, 10, 120.9, 44.09, 874.234, 233.333 }
0        18.3168
1        6.2625e+06
2        84.7125
3        10100
4        4.14505e+09
5        5.09387e+07
6        3.34082e+08
7        1.10903e+09
 
real    0m2.451s
user    0m2.436s
sys     0m0.000s

Indeed, the SSE2 version seems to perform better! I also compared my SSE2 version against the -O3 optimized C-code, but there was not much difference from the -O1 optimized code for this particular algorithm. But of course, YMMV.

Does this mean we should start writing assembly code for performance optimization? Probably not. I still think it’s much better to write a better algorithm in the first place than re-write code in assembly language, because re-writing in assembly is itself not a guarantee for a better performance. But for serious performance work, however, knowing what type of assembly code that the compiler generates from a C/C++ code, for various compiler optimization flags, will undoubtedly benefit. You never know, in a few extreme cases, it may even make sense to write parts of the application code in assembly, even if that means having to write that part in assembly for every platform that app needs to support. OO.o has some parts of UNO bridge written in assembly, and I’ve seen some assembly code in the FireFox codebase as well.

Oh, by the way, for anyone looking for a good study guide on GNU assembly for X86 family of chips, the “Professional Assembly Language” by Richard Blum (Wiley Publishing) would be a pretty good place to start.

Solver binary (take 2)

Eike informed me that the previous binary I posted was not deployable with a regular OO.o build due to binary incompatibility with gcc (or g++ rather). I compiled my last binary using gcc 4.0, and gcc 4.0 apparently introduces compatibility symbols GLIBCXX_3.4.4 and GLIBCXX_3.4.6 which are not understood by a binary built using gcc 3.4.1 (such as a regular OO.o build). See Eike’s this blog entry for more details.

So, I downloaded gcc 3.4.1, and rebuilt my solver package. I ran objdump -T scsolver.uno.so to make sure there were no GLIBCXX_3.4.x symbols (there weren’t). So, if you were having problem executing my solver using the previous binary, please try this one. Thanks.