Agner Fog's e-books were found by me on the internet and I got excited about his asmlib library where he implemented memory optimizations in assembly. A table suggests in his books that his implementations are very efficient. I am a sceptic-type person, so I did some measurements under various circumstances. My simple program, what I executed, did random memory copies into two character arrays. In the aligned tests, the program copied 10000 bytes at a time and in the unaligned tests, 9999 bytes. The program below was compiled with varied gcc optimization flags and executed with the time program to measure the run-time:
{
char a1[1100000];
char a2[10000];
srand(time(NULL));
for (int i = 0; i < 1000000; i++)
memcpy(&a1[rand() % 1000000], a2, 10000);
}
In the columns of the measurements:
1. Without asmlib, without -fno-builtin flag to gcc.
3. With asmlib, without -fno-builtin flag to gcc.
4. With asmlib, with -fno-builtin flag to gcc.
The -fno-builtin flag means to disable the gcc built-in version of the memory operations. The built-in version can be optimized by the gcc in higher optimization levels if possible.
A) CPU: 32 bit, Operating system: 32 bit
Details: AMD Sempron 2800+ (Thoroughbred-B, 2000 Mhz), Windows XP Home (MinGW, gcc 4.4.1).
Aligned tests:
gcc -O0: 2.7 s, 2.7 s; 2.7 s, 2.7 s
gcc -O1: 2.7 s, 2.7 s; 2.7 s, 2.7 s
gcc -O2: 0.03 s, 2.7 s; 0.03 s, 2.7 s
gcc -O3: 0.03 s, 2.7 s; 0.03 s, 2.7 s
gcc -Os: 0.03 s, 2.7 s; 0.03 s, 2.7 s
Unaligned tests:
gcc -O0: 2.8 s, 2.8 s; 2.8 s, 2.7 s
gcc -O1: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -O2: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -O3: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -Os: 8.2 s, 2.8 s; 8.2 s, 2.7 s
For this kind of CPU, the recommendation:
- If a program has many aligned memory copies, the -O2 and -O3 are adviced without -fno-builtin.
- If a program has many unaligned memory copies, the -fno-builtin flag is advised.
- asmlib does not give more performance for memcpy.
Details: Intel Core Duo (U2500, 1200 Mhz), andLinux (coLinux with Ubuntu Jaunty, gcc 4.3.3, libc6 2.9) under Windows XP Home.
Aligned tests:
gcc -O0: 6.5 s, 6.0 s; 4.2 s, 4.1 s
gcc -O1: 6.7 s, 6.7 s; 4.1 s, 4.1 s
gcc -O2: 6.8 s, 6.7 s; 4.1 s, 4.1 s
gcc -O3: 6.7 s, 6.7 s; 4.1 s, 4.1 s
gcc -O0: 6.5 s
gcc -Os: 18.0 s, 18.0 s; 18.0 s, 18.0 s
The coLinux is a "native" Linux environment under Windows. The gcc can not optimize the built-in memory copy operator and the asmlib does a decent job, only the -Os option should be avoided for unaligned copies.
Details: Intel Core Duo (U2500, 1200 Mhz), Windows XP Home (MinGW, gcc 4.4.1).
Aligned tests:
gcc -O0: 3.6 s, 5.0 s; 3.6 s, 3.5 s
gcc -O1: 5.0 s, 5.7 s; 5.0 s, 3.5 s
gcc -O2: 0.03 s, 5.7 s; 0.03 s, 3.5 s
gcc -O3: 0.03 s, 5.7 s; 0.03 s, 3.5 s
gcc -Os: 0.03 s, 5.7 s; 0.03 s, 3.5 s
Unaligned tests:
gcc -O0: 5.2 s, 5.3 s; 3.6 s, 3.6 s
gcc -O1: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -O2: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -O3: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -Os: 15.0 s, 5.8 s; 15.0 s, 3.6 s
The optimized aligned copies are so fast as AMD Sempron, but the latter is much more better in other cases. It can be the effect of the raw extra frequency. If the case is other than the aligned memory copies with higher optimization flags, the linking with asmlib is adviced with -fno-builtin operator.
D) CPU: 64 bit, Operating system: 64 bit
Details: Intel Core2 Duo (P9300, 2260 Mhz), Ubuntu Karmic, gcc 4.4.1, libc6 2.10.
Aligned tests:
gcc -O0: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O1: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O2: 0.01 s, 0.01 s; 0.01 s, 0.01 s
gcc -O3: 0.01 s, 0.01 s; 0.01 s, 0.01 s
gcc -Os: 0.01 s, 0.01 s; 0.01 s, 0.01 s
Unaligned tests:
gcc -O0: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O1: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O2: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O3: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -Os: 5.0 s, 5.0 s; 5.0 s, 5.0 s
Well, the results are impressive. It is interesting to see that the gcc forced the internal memory operations (?) with -fno-builtin option with -O2,-O3 and the aligned copies were highly optimized under every circumstances. The asmlib optimizes more than 50 % in the unaligned tests and aligned tests with lower optimization. It is highly adviced to link with asmlib for this processor.
Conclusion
Many conditions effect the possible optimizations of the code: CPU, operating system, compiler. There are no universal solutions and the situation can be complicated by varying the optimization flags for each source files or override the memcpy operator only in some source files for faster unaligned copies. The right code starts from the choice of the right algorithm and the optimization should be just a fine tuning for the final program.