For the past several weeks I have been studying X86 assembly language, mainly because I wanted to update my knowledge on the assembly language to match the latest CPU technology. I had previosly taken an X86 assembly language course at NCSU roughly a year ago, but the course only covered 8086 instruction set, and used the MASM version 6.0 as the assembler which is only good for writing MS-DOS applications. I wanted to at least learn how to do floating-point calculations in assembly, and do it in GNU assembly so that my apps would run on Linux.
There are quite a few extensions to the core X86 instructions, such as FPU, MMX, SSE, and SSE2. The FPU takes care of normal floating point calculations since 80386, MMX for operating multiple integer calculations in a single CPU cycle, SSE for multiple single-precision calculations, and SSE2 for multiple double-precision calculations (again, in a single CPU cycle). Since software these days, and OO.o in particular, seem to do almost all of floating point calculations in double-precision, I decided to give SSE2 a little benchmark test.
Here is how I did it. I wrote some simple mathematical routines in C, compiled it normally with gcc with -O1 optimization. Then I had gcc generate an assembly code of that routine, cleaned it up a bit and replaced several instructions with SSE2 instructions, reassembled it into an executable to run benchmark.
Here is the original C code for the routine:
void array_multiply(double* array1, double* array2, unsigned int size) { // Make nloop a multiple of 2. unsigned int nloop = size/2; nloop += nloop; unsigned int i = 0; for (; i < nloop; i += 2) { array1[i] *= array2[i] * array2[i] * array2[i]; array1[i+1] *= array2[i+1] * array2[i+1] * array2[i+1]; } } |
and this is the assembly instructions that gcc generated (with -O1):
.text .align 2 .globl array_multiply .type array_multiply, @function array_multiply: .LFB13: pushl %ebp .LCFI0: movl %esp, %ebp .LCFI1: pushl %ebx .LCFI2: movl 8(%ebp), %edx movl 12(%ebp), %ebx movl 16(%ebp), %ecx andl $-2, %ecx je .L5 movl $0, %eax .L4: fldl (%ebx,%eax,8) fld %st(0) fmul %st(1), %st fmulp %st, %st(1) fmull (%edx,%eax,8) fstpl (%edx,%eax,8) fldl 8(%ebx,%eax,8) fld %st(0) fmul %st(1), %st fmulp %st, %st(1) fmull 8(%edx,%eax,8) fstpl 8(%edx,%eax,8) addl $2, %eax cmpl %eax, %ecx ja .L4 .L5: popl %ebx popl %ebp ret |
It does all the calculations using FPU instructions. And here is the assembly code after I replaced the FPU instructions with SSE2 ones:
.section .text .align 16 .globl array_multiply .type array_multiply, @function array_multiply: pushl %ebp movl %esp, %ebp pushl %ebx movl 8(%ebp), %edx # pointer to array1 movl 12(%ebp), %ebx # pointer to array2 movl 16(%ebp), %ecx # size andl $-2, %ecx # make the size a multiple of 2 je .L5 movl $0, %eax # i = 0 .L4: movapd (%edx,%eax,8), %xmm0 # SSE2 movupd (%ebx,%eax,8), %xmm1 # SSE2 mulpd %xmm1, %xmm0 # SSE2 mulpd %xmm1, %xmm0 # SSE2 mulpd %xmm1, %xmm0 # SSE2 movapd %xmm0, (%edx,%eax,8) # SSE2 addl $2, %eax # i += 2 cmpl %eax, %ecx ja .L4 .L5: popl %ebx popl %ebp ret |
I then used the following main C++ code
void print_array(double* array, unsigned int size) { cout << "{ "; for (unsigned int i = 0; i < size; ++i) { if (i) cout << ", "; cout << array[i]; } cout << " }" << endl; } int main() { double myarray1[] = {10.5, 50.0, 25.0, 10.0, 2345.4848, 594.23, 0.4, 87.2}; double myarray2[] = {1.2, 50.0, 1.5, 10.0, 120.9, 44.09, 874.234, 233.333}; unsigned int array_size = 8; cout << "myarray1 = "; print_array(myarray1, array_size); cout << "myarray2 = "; print_array(myarray2, array_size); double myarray[array_size]; for (long counter = 0; counter < 99999999; ++counter) { for (unsigned int i = 0; i < array_size; ++i) // To prevent calculation results from being cached. myarray[i] = myarray1[i] + 0.000000001*counter; array_multiply(myarray, myarray2, array_size); } for (unsigned int i = 0; i < array_size; ++i) cout << i << " t " << myarray[i] << endl; } |
to call both the C version and the assembly with SSE2 version to compare performance. The executables with the original C version and the SSE2 version are named test_c and test_sse, respectively. Here is the result (on my machine):
$ time ./test_c myarray1 = { 10.5, 50, 25, 10, 2345.48, 594.23, 0.4, 87.2 } myarray2 = { 1.2, 50, 1.5, 10, 120.9, 44.09, 874.234, 233.333 } 0 18.3168 1 6.2625e+06 2 84.7125 3 10100 4 4.14505e+09 5 5.09387e+07 6 3.34082e+08 7 1.10903e+09 real 0m3.308s user 0m3.292s sys 0m0.012s $ time ./test_sse myarray1 = { 10.5, 50, 25, 10, 2345.48, 594.23, 0.4, 87.2 } myarray2 = { 1.2, 50, 1.5, 10, 120.9, 44.09, 874.234, 233.333 } 0 18.3168 1 6.2625e+06 2 84.7125 3 10100 4 4.14505e+09 5 5.09387e+07 6 3.34082e+08 7 1.10903e+09 real 0m2.451s user 0m2.436s sys 0m0.000s |
Indeed, the SSE2 version seems to perform better! I also compared my SSE2 version against the -O3 optimized C-code, but there was not much difference from the -O1 optimized code for this particular algorithm. But of course, YMMV.
Does this mean we should start writing assembly code for performance optimization? Probably not. I still think it’s much better to write a better algorithm in the first place than re-write code in assembly language, because re-writing in assembly is itself not a guarantee for a better performance. But for serious performance work, however, knowing what type of assembly code that the compiler generates from a C/C++ code, for various compiler optimization flags, will undoubtedly benefit. You never know, in a few extreme cases, it may even make sense to write parts of the application code in assembly, even if that means having to write that part in assembly for every platform that app needs to support. OO.o has some parts of UNO bridge written in assembly, and I’ve seen some assembly code in the FireFox codebase as well.
Oh, by the way, for anyone looking for a good study guide on GNU assembly for X86 family of chips, the “Professional Assembly Language” by Richard Blum (Wiley Publishing) would be a pretty good place to start.