make my way: április 2010

In my software, because of the data flows in the system and image processing operations, large segments of memory can be allocated/deallocated/copied. It is a straightforward direction to find ways to optimize these operations in the system. Earlier, I integrated the Google's TCMalloc (malloc replacement) to speed up the memory allocations and it was efficient to reduce the run-time by 10 %.
Agner Fog's e-books were found by me on the internet and I got excited about his asmlib library where he implemented memory optimizations in assembly. A table suggests in his books that his implementations are very efficient. I am a sceptic-type person, so I did some measurements under various circumstances. My simple program, what I executed, did random memory copies into two character arrays. In the aligned tests, the program copied 10000 bytes at a time and in the unaligned tests, 9999 bytes. The program below was compiled with varied gcc optimization flags and executed with the time program to measure the run-time:

int main(int, char**)
{
char a1[1100000];
char a2[10000];

srand(time(NULL));

for (int i = 0; i < 1000000; i++)
memcpy(&a1[rand() % 1000000], a2, 10000);
}

In the columns of the measurements:

1. Without asmlib, without -fno-builtin flag to gcc.
2. Without asmlib, with -fno-builtin flag to gcc.
3. With asmlib, without -fno-builtin flag to gcc.
4. With asmlib, with -fno-builtin flag to gcc.

The -fno-builtin flag means to disable the gcc built-in version of the memory operations. The built-in version can be optimized by the gcc in higher optimization levels if possible.

A) CPU: 32 bit, Operating system: 32 bit
Details: AMD Sempron 2800+ (Thoroughbred-B, 2000 Mhz), Windows XP Home (MinGW, gcc 4.4.1).

Aligned tests:

gcc -O0: 2.7 s, 2.7 s; 2.7 s, 2.7 s
gcc -O1: 2.7 s, 2.7 s; 2.7 s, 2.7 s
gcc -O2: 0.03 s, 2.7 s; 0.03 s, 2.7 s
gcc -O3: 0.03 s, 2.7 s; 0.03 s, 2.7 s
gcc -Os: 0.03 s, 2.7 s; 0.03 s, 2.7 s

Unaligned tests:

gcc -O0: 2.8 s, 2.8 s; 2.8 s, 2.7 s
gcc -O1: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -O2: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -O3: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -Os: 8.2 s, 2.8 s; 8.2 s, 2.7 s

For this kind of CPU, the recommendation:
- If a program has many aligned memory copies, the -O2 and -O3 are adviced without -fno-builtin.
- If a program has many unaligned memory copies, the -fno-builtin flag is advised.
- asmlib does not give more performance for memcpy.

B) CPU: 32 bit, Operating system: 32 bit
Details: Intel Core Duo (U2500, 1200 Mhz), andLinux (coLinux with Ubuntu Jaunty, gcc 4.3.3, libc6 2.9) under Windows XP Home.

Aligned tests:
gcc -O0: 6.5 s, 6.0 s; 4.2 s, 4.1 s
gcc -O1: 6.7 s, 6.7 s; 4.1 s, 4.1 s
gcc -O2: 6.8 s, 6.7 s; 4.1 s, 4.1 s
gcc -O3: 6.7 s, 6.7 s; 4.1 s, 4.1 s
gcc -Os: 6.0 s, 6.0 s; 6.0 s, 6.0 s

Unaligned tests:

gcc -O0: 6.5 s, 6.5 s; 4.2 s, 4.1 s
gcc -O1: 6.8 s, 6.8 s; 4.1 s, 4.1 s
gcc -O2: 6.8 s, 6.8 s; 4.1 s, 4.1 s
gcc -O3: 6.8 s, 6.8 s; 4.1 s, 4.1 s
gcc -Os: 18.0 s, 18.0 s; 18.0 s, 18.0 s

The coLinux is a "native" Linux environment under Windows. The gcc can not optimize the built-in memory copy operator and the asmlib does a decent job, only the -Os option should be avoided for unaligned copies. The asmlib can reduce the run-time by ~38 %, it is recommended to use with or without -fno-builtin.

C) CPU: 32 bit, Operating system: 32 bit
Details: Intel Core Duo (U2500, 1200 Mhz), Windows XP Home (MinGW, gcc 4.4.1).

Aligned tests:

gcc -O0: 3.6 s, 5.0 s; 3.6 s, 3.5 s
gcc -O1: 5.0 s, 5.7 s; 5.0 s, 3.5 s
gcc -O2: 0.03 s, 5.7 s; 0.03 s, 3.5 s
gcc -O3: 0.03 s, 5.7 s; 0.03 s, 3.5 s
gcc -Os: 0.03 s, 5.7 s; 0.03 s, 3.5 s

Unaligned tests:

gcc -O0: 5.2 s, 5.3 s; 3.6 s, 3.6 s
gcc -O1: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -O2: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -O3: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -Os: 15.0 s, 5.8 s; 15.0 s, 3.6 s

The optimized aligned copies are so fast as AMD Sempron, but the latter is much more better in other cases. It can be the effect of the raw extra frequency. If the case is other than the aligned memory copies with higher optimization flags, the linking with asmlib is adviced with -fno-builtin operator.

D) CPU: 64 bit, Operating system: 64 bit
Details: Intel Core2 Duo (P9300, 2260 Mhz), Ubuntu Karmic, gcc 4.4.1, libc6 2.10.

Aligned tests:

gcc -O0: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O1: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O2: 0.01 s, 0.01 s; 0.01 s, 0.01 s
gcc -O3: 0.01 s, 0.01 s; 0.01 s, 0.01 s
gcc -Os: 0.01 s, 0.01 s; 0.01 s, 0.01 s

Unaligned tests:

gcc -O0: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O1: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O2: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O3: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -Os: 5.0 s, 5.0 s; 5.0 s, 5.0 s

Well, the results are impressive. It is interesting to see that the gcc forced the internal memory operations (?) with -fno-builtin option with -O2,-O3 and the aligned copies were highly optimized under every circumstances. The asmlib optimizes more than 50 % in the unaligned tests and aligned tests with lower optimization. It is highly adviced to link with asmlib for this processor.

Conclusion

Many conditions effect the possible optimizations of the code: CPU, operating system, compiler. There are no universal solutions and the situation can be complicated by varying the optimization flags for each source files or override the memcpy operator only in some source files for faster unaligned copies. The right code starts from the choice of the right algorithm and the optimization should be just a fine tuning for the final program.

Well, I realized after onehalf month of trials that it is quite hard to get involved with Ubuntu. I think I am a professional and I understand that I should prove this to get some access to the community maintained repositories, but I only would like to maintain/provide fixes to some software. The way to provide fixes, just does not work for me. Everybody is very busy there, it is hard to push fixes forward. At least, this is my experience and I have more important things than struggle with these issues now. Seems to me the best way for the required backports and fixes, if I maintain that software stack in my PPA.

A good example is the google-perftools+libunwind. The google-perftools is broken for amd64 architecture since (at least) Jaunty. Debian integrated newer packages and they come into Lucid, but

https://bugs.launchpad.net/ubuntu/+source/libunwind/+bug/522106

If somebody reads the bug with the linked bugs from other sources it is clear that it is easy to fix this issue: disable the setjmp library completely or only for amd64 where it does not compile. The bug is untouched more than one month ago. I have two choices:
1. Implement this quick fix and drop the packages in my PPA for Lucid and older versions (Karmic, Jaunty...) -> A few hours maximum.
2. Implement this quick fix and start the process to get it in Lucid and start an other process to get the packages into backports. -> Who knows how long time...?

A quick fix and advertising my PPA to the interested people seems to me much more easier. And I am not talking about putting new (already packaged) software to Ubuntu what is a nightmare compared to this.

Note that my thoughts can not be applied to crucial and complex packages (e.g. OpenOffice or GTK).

make my way

2010. április 21., szerda

Agner Fog's asmlib optimization of memcpy - A short study

2010. április 9., péntek

Ubuntu experiences

Előkerült a bicaj.

Blog Archive

About Me

Traffic Feed