2017. május 8., hétfő

Zapcc Compiler

Once I noticed that a small startup company (Ceemple) introduced a modified Clang compiler (Zapcc) to reduce the compilation time. Instead of a disk-based cache like ccache, they cache some intermediate compilation pass results in memory (at least that is my bet) and reuse it for later compilation units. I modified the CMake build system in my robotics research project (AiBO+) and measured their claims. I found out that they don't overpromise. My project is about 90 kLOC now and zapcc was 30% faster than vanilla Clang 4.0.

Gcc 5.4: 6:40 min
Clang 4.0: 6 min
Zapcc: 4 min

I highly recommend their product. They don't just promise, but deliver. And it is really a drop-in replacement for gcc/Clang on Linux 64-bit environment.

2016. november 10., csütörtök

When you can use folly::ThreadLocalPtr instead of boost::thread_specific_ptr

In this post, I compare two solutions for thread local storage:

1. boost::thread_specific_ptr in Boost library which provides a cross-platform solution to store data per thread.
URL: http://www.boost.org/

2. Facebook's folly has folly::ThreadLocalPtr which has a very similar interface to boost::thread_specific_ptr for easy adaption and it is claimed to be 4x faster. The disadvantage of folly is that it supports only 64 bit Mac OS X+Linux distributions (e.g no Windows) and it requires a modern C++ compiler with C++11 support.

URL: https://github.com/facebook/folly
URL: https://github.com/facebook/folly/blob/master/folly/docs/ThreadLocal.md

I did not write a synthetic case for performance testing which may justify how faster folly is, but I come up with a real world use case from my research project. I develop an artificial intelligence for the old Sony ERS-7 robot dogs, and as part of my efforts, I can simulate the AI in a thread on my host Linux machine to use multithreaded testing. When I replaced Boost with folly in the testing, I was surprised that it is really faster. Obviously not 4x, but it is still much faster. On the other hand, I saw a significant difference in the folly performance when different compilers were used and decided to create this blog post.

Test case: I run one AI simulation test case in a single thread 100 times and the averaged runtime is shown in the diagrams below. In one place, my code was inefficient, using the thread local pointer (TLP) inside a frequently called loop. Because of this performance bug, the long execution time was mainly caused by this TLP bottleneck. After I moved the TLP usage outside the loop, the performance gains were not relevant with folly anymore. I still think that it has some value to publish these numbers how folly can improve a situation when TLP is heavily used or not.

Compiler setup

Important compiler switches: -O3 -std=gnu++14
I-don't-think-so-relevant compiler switches: -mfpmath=sse -march=core2 -fPIC

Gcc 5.4 came the official Ubuntu Xenial repositories
Gcc 6.2 was installed from a toolchain testing PPA for Ubuntu (https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/test)
Clang 3.9 and 4.0 (4.0~svn286079) were installed from LLVM repositories (http://apt.llvm.org).

Case 1

So in this case, the TLP was used inefficiently, causing a large part of the runtime. As we can see, the the Boost results are almost identical, gcc is a bit faster than clang. However, Folly is not only faster with gcc, but Clang provides a much better performance than gcc. Although gcc 6 improved the performance a bit over gcc 5, but clang 4.0 is still 17.7 % faster than gcc 6.2.


Left axis is execution time in milliseconds. Lower is better.


Case 2

After the TLP usage was fixed in my codes, the compilers delivered almost identical results since the TLP access time did not play such main role like in Case 1. Boost and folly results are quite similar, but clang was a bit faster with Folly by a small margin. gcc 5.4 was unexpectedly faster with Folly, but I would assume it was a coincidence of some optimizations since that compiler was the slowest with folly in Case 1.


Left axis is execution time in milliseconds. Lower is better.


Verdict

No silver bullets here. If you have a program which heavily use thread local pointers under Linux or Mac OS X, it is recommended to try folly to gain some speed. Otherwise Boost provides a generic cross-platform solution when TLP is not used all over the place.

2016. március 11., péntek

Unity Build macro for CMake

A half year ago, I came across some techniques to speed up the C++ compilation for bigger projects. As my evergrowing AiBO+ project is around 87000 C++ source lines without 3rd party libraries now, it really became important to keep the compilation time low. Apart from the ccache integration in my CMake build system, I decided to try out the Unity Build method. When I looked around the internet, I found that the Unity Build would shorten the build duration significantly and some people crafted some small CMake scripts, but these solutions were incomplete. Either they did not handle the Qt's moc files at all or all Unity files were regenerated each time when a CMake reconfigure was initiated. An other example is Cotire which does not handle the dependencies correctly (https://github.com/sakra/cotire/issues/77).

My script is based on Christoph Heindl's work although heavily modified. So here it is:

https://sourceforge.net/p/aiboplus/code/ci/master/tree/aiboplus/UnityBuild.cmake

The features of my Unity Build script for CMake:
- Easy to add to existing projects
- Configurable extension for the generated Unity Build files (c, cpp, cc etc.)
- Easy to exclude certain problematic sources from the Unity Build file generation
- Source file count limit for the Unity Build files
- Working out-of-source builds
- Qt support (handling moc files correctly)
- Track the source file changes with md5 hashes
- Regenerate the Unity Build files only when really needed

Limitation:
- The Unity Build files are not removed with "make clean".

Note:
- The UNITY_GENERATE_MOC() macro is optional. I wrote this lightweight moc file generation for Qt instead of the default slower moc invocations in CMake.

Basic usage of my script:

- Copy UnityBuild.cmake to your project's root directory.
- Include in the root CMakeLists.txt:

INCLUDE(UnityBuild.cmake)

- In any source directory, use like this:

SET(LIBEXAMPLE_SRC
    first.c ;
    second.c ;
    third.c ;
    )

# Parameters:
# 1. Name prefix for the generated Unity Build files
# 2. CMake variable which contains the source files
# 3. Unit size (source file count) per Unity Build file
# 4. Extension for the Unity Build files to invoke the correct compiler
ENABLE_UNITY_BUILD(libexample LIBEXAMPLE_SRC 10 c)

# Optional: Add any problematic source file which are not suitable for Unity Build and
# you are lazy to fix it.
SET(LIBEXAMPLE_SRC ${LIBEXAMPLE_SRC} lazy.c)

ADD_LIBRARY(libexample STATIC ${LIBEXAMPLE_SRC})

2015. augusztus 13., csütörtök

Shishi odoshi vs. nyúl

Szal, egy nyúl ma azt hitte, hogy felzabálja a büszke bokrunkat a kertben. Aztán a shishi odoshi helyrerakta.

2015. augusztus 6., csütörtök

A new update of my artificial intelligence for Sony ERS-7 robots

Here comes the new update of my AiBO+ AIBOWare for Sony ERS-7: AiBO+ 3.1 - Let's Play!

New changes in this release:

- New sound profiles with voice acting (text-to-speech, Marissa, Gillian)
- A game can be played with AIBO
- Reworked pick-up mode/fall over/poking detection
- Configurable adaptive volume control
- Quicker responses for petting
- Less stress on the motors
- Many small fixes

The AiBO+ AIBOWare must be installed in a PMS stick and an AiBO+ Client application can be run under Windows, Ubuntu Linux and Android.

The installation instructions and a detailed user guide are available on this page:

http://aiboplus.sourceforge.net/userguide.html

2015. augusztus 3., hétfő

Second astrophoto - Shapley 1

Shapley 1 (planetary nebula)

Captured with telescope T32 of iTelescope network, Side Springs Observatory, Australia.

Capture parameters: LRGB, 14x120, 4x120, 4x120, 4x120.

It is my second trial with astroimaging. I was a bit surprised that I got better results when I tried to work with uncalibrated images. I tried various filters, but none of them given reasonable results with shutter speed 300 seconds.

 

2015. május 24., vasárnap

Conversion script for FujiFilm FinePix W3 3D camera videos (avi) to SBS mp4 under Linux

When the following script is run, it looks for avi files in the current folder whose name starts with DSC... and their format is motions jpeg.
The left/right channels of the original W3 video are extracted and a final side-by-side video is created. All new videos will be in mp4 format.

Here it is the script:

#!/bin/bash

file_pattern="$@"

if [ ! $file_pattern ]; then
  file_pattern="DSC*.avi DSC*.AVI"
fi

for file in $(ls -f $file_pattern 2> /dev/null)
do
  echo Checking $file
  if [[ "$(file $file)" == *"video: Motion JPEG"* ]]; then
    echo Processing $file
    filenamebase=${file%.*}
    echo Create left video: "$filenamebase"_l.mp4
    ffmpeg -loglevel error -i $file -f mp4 -acodec mp3 -map 0:v:1 -map 0:a:0 "$filenamebase"_l.mp4
    echo Create right video: "$filenamebase"_r.mp4
    ffmpeg -loglevel error -i $file -f mp4 -acodec mp3 -map 0:v:0 -map 0:a:0 "$filenamebase"_r.mp4
    echo Join left/right videos to a half SBS video: "$filenamebase"_sbs.mp4
    ffmpeg -loglevel error -i "$filenamebase"_l.mp4 -i "$filenamebase"_r.mp4 \
           -filter_complex "pad=in_w*2:in_h, overlay=main_w/2:0, scale=in_w/2:in_h" \
           -f mp4 -b:v 6000k -acodec copy -map 0:a:0 "$filenamebase"_sbs.mp4
    echo Rename left video to 2d version: "$filenamebase"_l.mp4 to "$filenamebase"_2d.mp4
    mv "$filenamebase"_l.mp4 "$filenamebase"_2d.mp4
  fi
done