2010. április 21., szerda

Agner Fog's asmlib optimization of memcpy - A short study

In my software, because of the data flows in the system and image processing operations, large segments of memory can be allocated/deallocated/copied. It is a straightforward direction to find ways to optimize these operations in the system. Earlier, I integrated the Google's TCMalloc (malloc replacement) to speed up the memory allocations and it was efficient to reduce the run-time by 10 %.
Agner Fog's e-books were found by me on the internet and I got excited about his asmlib library where he implemented memory optimizations in assembly. A table suggests in his books that his implementations are very efficient. I am a sceptic-type person, so I did some measurements under various circumstances. My simple program, what I executed, did random memory copies into two character arrays. In the aligned tests, the program copied 10000 bytes at a time and in the unaligned tests, 9999 bytes. The program below was compiled with varied gcc optimization flags and executed with the time program to measure the run-time:

int main(int, char**)
{
char a1[1100000];
char a2[10000];

srand(time(NULL));

for (int i = 0; i < 1000000; i++)
memcpy(&a1[rand() % 1000000], a2, 10000);
}

In the columns of the measurements:

1. Without asmlib, without -fno-builtin flag to gcc.
2. Without asmlib, with -fno-builtin flag to gcc.
3. With asmlib, without -fno-builtin flag to gcc.
4. With asmlib, with -fno-builtin flag to gcc.

The -fno-builtin flag means to disable the gcc built-in version of the memory operations. The built-in version can be optimized by the gcc in higher optimization levels if possible.

A) CPU: 32 bit, Operating system: 32 bit
Details: AMD Sempron 2800+ (Thoroughbred-B, 2000 Mhz), Windows XP Home (MinGW, gcc 4.4.1).

Aligned tests:

gcc -O0: 2.7 s, 2.7 s; 2.7 s, 2.7 s
gcc -O1: 2.7 s, 2.7 s; 2.7 s, 2.7 s
gcc -O2: 0.03 s, 2.7 s; 0.03 s, 2.7 s
gcc -O3: 0.03 s, 2.7 s; 0.03 s, 2.7 s
gcc -Os: 0.03 s, 2.7 s; 0.03 s, 2.7 s

Unaligned tests:

gcc -O0: 2.8 s, 2.8 s; 2.8 s, 2.7 s
gcc -O1: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -O2: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -O3: 8.2 s, 2.8 s; 8.2 s, 2.7 s
gcc -Os: 8.2 s, 2.8 s; 8.2 s, 2.7 s

For this kind of CPU, the recommendation:
- If a program has many aligned memory copies, the -O2 and -O3 are adviced without -fno-builtin.
- If a program has many unaligned memory copies, the -fno-builtin flag is advised.
- asmlib does not give more performance for memcpy.

B) CPU: 32 bit, Operating system: 32 bit
Details: Intel Core Duo (U2500, 1200 Mhz), andLinux (coLinux with Ubuntu Jaunty, gcc 4.3.3, libc6 2.9) under Windows XP Home.

Aligned tests:
gcc -O0: 6.5 s, 6.0 s; 4.2 s, 4.1 s
gcc -O1: 6.7 s, 6.7 s; 4.1 s, 4.1 s
gcc -O2: 6.8 s, 6.7 s; 4.1 s, 4.1 s
gcc -O3: 6.7 s, 6.7 s; 4.1 s, 4.1 s
gcc -Os: 6.0 s, 6.0 s; 6.0 s, 6.0 s

Unaligned tests:

gcc -O0: 6.5 s
, 6.5 s; 4.2 s, 4.1 s
gcc -O1: 6.8 s, 6.8 s; 4.1 s, 4.1 s
gcc -O2: 6.8 s, 6.8 s; 4.1 s, 4.1 s
gcc -O3: 6.8 s, 6.8 s; 4.1 s, 4.1 s
gcc -Os: 18.0 s, 18.0 s; 18.0 s, 18.0 s

The coLinux is a "native" Linux environment under Windows. The gcc can not optimize the built-in memory copy operator and the asmlib does a decent job, only the -Os option should be avoided for unaligned copies. The asmlib can reduce the run-time by ~38 %, it is recommended to use with or without -fno-builtin.

C) CPU: 32 bit, Operating system: 32 bit
Details: Intel Core Duo (U2500, 1200 Mhz), Windows XP Home (MinGW, gcc 4.4.1).

Aligned tests:

gcc -O0: 3.6 s, 5.0 s; 3.6 s, 3.5 s
gcc -O1: 5.0 s, 5.7 s; 5.0 s, 3.5 s
gcc -O2: 0.03 s, 5.7 s; 0.03 s, 3.5 s
gcc -O3: 0.03 s, 5.7 s; 0.03 s, 3.5 s
gcc -Os: 0.03 s, 5.7 s; 0.03 s, 3.5 s

Unaligned tests:

gcc -O0: 5.2 s, 5.3 s; 3.6 s, 3.6 s
gcc -O1: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -O2: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -O3: 15.0 s, 5.8 s; 15.0 s, 3.6 s
gcc -Os: 15.0 s, 5.8 s; 15.0 s, 3.6 s

The optimized aligned copies are so fast as AMD Sempron, but the latter is much more better in other cases. It can be the effect of the raw extra frequency. If the case is other than the aligned memory copies with higher optimization flags, the linking with asmlib is adviced with -fno-builtin operator.

D) CPU: 64 bit, Operating system: 64 bit
Details: Intel Core2 Duo (P9300, 2260 Mhz), Ubuntu Karmic, gcc 4.4.1, libc6 2.10.

Aligned tests:

gcc -O0: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O1: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O2: 0.01 s, 0.01 s; 0.01 s, 0.01 s
gcc -O3: 0.01 s, 0.01 s; 0.01 s, 0.01 s
gcc -Os: 0.01 s, 0.01 s; 0.01 s, 0.01 s

Unaligned tests:

gcc -O0: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O1: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O2: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -O3: 1.8 s, 1.8 s; 0.8 s, 0.8 s
gcc -Os: 5.0 s, 5.0 s; 5.0 s, 5.0 s

Well, the results are impressive. It is interesting to see that the gcc forced the internal memory operations (?) with -fno-builtin option with -O2,-O3 and the aligned copies were highly optimized under every circumstances. The asmlib optimizes more than 50 % in the unaligned tests and aligned tests with lower optimization. It is highly adviced to link with asmlib for this processor.

Conclusion

Many conditions effect the possible optimizations of the code: CPU, operating system, compiler. There are no universal solutions and the situation can be complicated by varying the optimization flags for each source files or override the memcpy operator only in some source files for faster unaligned copies. The right code starts from the choice of the right algorithm and the optimization should be just a fine tuning for the final program.

Nincsenek megjegyzések: