2016. november 10., csütörtök

When you can use folly::ThreadLocalPtr instead of boost::thread_specific_ptr

In this post, I compare two solutions for thread local storage:

1. boost::thread_specific_ptr in Boost library which provides a cross-platform solution to store data per thread.
URL: http://www.boost.org/

2. Facebook's folly has folly::ThreadLocalPtr which has a very similar interface to boost::thread_specific_ptr for easy adaption and it is claimed to be 4x faster. The disadvantage of folly is that it supports only 64 bit Mac OS X+Linux distributions (e.g no Windows) and it requires a modern C++ compiler with C++11 support.

URL: https://github.com/facebook/folly
URL: https://github.com/facebook/folly/blob/master/folly/docs/ThreadLocal.md

I did not write a synthetic case for performance testing which may justify how faster folly is, but I come up with a real world use case from my research project. I develop an artificial intelligence for the old Sony ERS-7 robot dogs, and as part of my efforts, I can simulate the AI in a thread on my host Linux machine to use multithreaded testing. When I replaced Boost with folly in the testing, I was surprised that it is really faster. Obviously not 4x, but it is still much faster. On the other hand, I saw a significant difference in the folly performance when different compilers were used and decided to create this blog post.

Test case: I run one AI simulation test case in a single thread 100 times and the averaged runtime is shown in the diagrams below. In one place, my code was inefficient, using the thread local pointer (TLP) inside a frequently called loop. Because of this performance bug, the long execution time was mainly caused by this TLP bottleneck. After I moved the TLP usage outside the loop, the performance gains were not relevant with folly anymore. I still think that it has some value to publish these numbers how folly can improve a situation when TLP is heavily used or not.

Compiler setup

Important compiler switches: -O3 -std=gnu++14
I-don't-think-so-relevant compiler switches: -mfpmath=sse -march=core2 -fPIC

Gcc 5.4 came the official Ubuntu Xenial repositories
Gcc 6.2 was installed from a toolchain testing PPA for Ubuntu (https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/test)
Clang 3.9 and 4.0 (4.0~svn286079) were installed from LLVM repositories (http://apt.llvm.org).

Case 1

So in this case, the TLP was used inefficiently, causing a large part of the runtime. As we can see, the the Boost results are almost identical, gcc is a bit faster than clang. However, Folly is not only faster with gcc, but Clang provides a much better performance than gcc. Although gcc 6 improved the performance a bit over gcc 5, but clang 4.0 is still 17.7 % faster than gcc 6.2.


Left axis is execution time in milliseconds. Lower is better.


Case 2

After the TLP usage was fixed in my codes, the compilers delivered almost identical results since the TLP access time did not play such main role like in Case 1. Boost and folly results are quite similar, but clang was a bit faster with Folly by a small margin. gcc 5.4 was unexpectedly faster with Folly, but I would assume it was a coincidence of some optimizations since that compiler was the slowest with folly in Case 1.


Left axis is execution time in milliseconds. Lower is better.


Verdict

No silver bullets here. If you have a program which heavily use thread local pointers under Linux or Mac OS X, it is recommended to try folly to gain some speed. Otherwise Boost provides a generic cross-platform solution when TLP is not used all over the place.

2016. március 11., péntek

Unity Build macro for CMake

A half year ago, I came across some techniques to speed up the C++ compilation for bigger projects. As my evergrowing AiBO+ project is around 87000 C++ source lines without 3rd party libraries now, it really became important to keep the compilation time low. Apart from the ccache integration in my CMake build system, I decided to try out the Unity Build method. When I looked around the internet, I found that the Unity Build would shorten the build duration significantly and some people crafted some small CMake scripts, but these solutions were incomplete. Either they did not handle the Qt's moc files at all or all Unity files were regenerated each time when a CMake reconfigure was initiated. An other example is Cotire which does not handle the dependencies correctly (https://github.com/sakra/cotire/issues/77).

My script is based on Christoph Heindl's work although heavily modified. So here it is:

https://sourceforge.net/p/aiboplus/code/ci/master/tree/aiboplus/UnityBuild.cmake

The features of my Unity Build script for CMake:
- Easy to add to existing projects
- Configurable extension for the generated Unity Build files (c, cpp, cc etc.)
- Easy to exclude certain problematic sources from the Unity Build file generation
- Source file count limit for the Unity Build files
- Working out-of-source builds
- Qt support (handling moc files correctly)
- Track the source file changes with md5 hashes
- Regenerate the Unity Build files only when really needed

Limitation:
- The Unity Build files are not removed with "make clean".

Note:
- The UNITY_GENERATE_MOC() macro is optional. I wrote this lightweight moc file generation for Qt instead of the default slower moc invocations in CMake.

Basic usage of my script:

- Copy UnityBuild.cmake to your project's root directory.
- Include in the root CMakeLists.txt:

INCLUDE(UnityBuild.cmake)

- In any source directory, use like this:

SET(LIBEXAMPLE_SRC
    first.c ;
    second.c ;
    third.c ;
    )

# Parameters:
# 1. Name prefix for the generated Unity Build files
# 2. CMake variable which contains the source files
# 3. Unit size (source file count) per Unity Build file
# 4. Extension for the Unity Build files to invoke the correct compiler
ENABLE_UNITY_BUILD(libexample LIBEXAMPLE_SRC 10 c)

# Optional: Add any problematic source file which are not suitable for Unity Build and
# you are lazy to fix it.
SET(LIBEXAMPLE_SRC ${LIBEXAMPLE_SRC} lazy.c)

ADD_LIBRARY(libexample STATIC ${LIBEXAMPLE_SRC})