make my way: november 2016

In this post, I compare two solutions for thread local storage:

1. boost::thread_specific_ptr in Boost library which provides a cross-platform solution to store data per thread.
URL: http://www.boost.org/

2. Facebook's folly has folly::ThreadLocalPtr which has a very similar interface to boost::thread_specific_ptr for easy adaption and it is claimed to be 4x faster. The disadvantage of folly is that it supports only 64 bit Mac OS X+Linux distributions (e.g no Windows) and it requires a modern C++ compiler with C++11 support.

URL: https://github.com/facebook/folly
URL: https://github.com/facebook/folly/blob/master/folly/docs/ThreadLocal.md

I did not write a synthetic case for performance testing which may justify how faster folly is, but I come up with a real world use case from my research project. I develop an artificial intelligence for the old Sony ERS-7 robot dogs, and as part of my efforts, I can simulate the AI in a thread on my host Linux machine to use multithreaded testing. When I replaced Boost with folly in the testing, I was surprised that it is really faster. Obviously not 4x, but it is still much faster. On the other hand, I saw a significant difference in the folly performance when different compilers were used and decided to create this blog post.

Test case: I run one AI simulation test case in a single thread 100 times and the averaged runtime is shown in the diagrams below. In one place, my code was inefficient, using the thread local pointer (TLP) inside a frequently called loop. Because of this performance bug, the long execution time was mainly caused by this TLP bottleneck. After I moved the TLP usage outside the loop, the performance gains were not relevant with folly anymore. I still think that it has some value to publish these numbers how folly can improve a situation when TLP is heavily used or not.

Compiler setup

Important compiler switches: -O3 -std=gnu++14
I-don't-think-so-relevant compiler switches: -mfpmath=sse -march=core2 -fPIC

Gcc 5.4 came the official Ubuntu Xenial repositories
Gcc 6.2 was installed from a toolchain testing PPA for Ubuntu (https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/test)
Clang 3.9 and 4.0 (4.0~svn286079) were installed from LLVM repositories (http://apt.llvm.org).

Case 1

So in this case, the TLP was used inefficiently, causing a large part of the runtime. As we can see, the the Boost results are almost identical, gcc is a bit faster than clang. However, Folly is not only faster with gcc, but Clang provides a much better performance than gcc. Although gcc 6 improved the performance a bit over gcc 5, but clang 4.0 is still 17.7 % faster than gcc 6.2.

Left axis is execution time in milliseconds. Lower is better.

Case 2

After the TLP usage was fixed in my codes, the compilers delivered almost identical results since the TLP access time did not play such main role like in Case 1. Boost and folly results are quite similar, but clang was a bit faster with Folly by a small margin. gcc 5.4 was unexpectedly faster with Folly, but I would assume it was a coincidence of some optimizations since that compiler was the slowest with folly in Case 1.

Left axis is execution time in milliseconds. Lower is better.

Verdict

No silver bullets here. If you have a program which heavily use thread local pointers under Linux or Mac OS X, it is recommended to try folly to gain some speed. Otherwise Boost provides a generic cross-platform solution when TLP is not used all over the place.

make my way

2016. november 10., csütörtök

When you can use folly::ThreadLocalPtr instead of boost::thread_specific_ptr

Blog Archive

About Me

Traffic Feed