Building Binaries for Bioinformatics

tl;dr there is a recipe for compiling binaries at the end

When it comes to working in bioinformatics, it is the best of times and it is the worst of times. For developers we have tons of modern tools, polished libraries, github and twitter and a sane extension of c++ called c++11. On the other hand we have clusters. Moreover, users have clusters that they don’t control, can’t install modern compilers (and by modern, I mean like two years old), can’t install libraries and compiling without root is pain.

I run a cluster which runs on the rocks cluster system. The newest version at the time was based on Centos6, the compiler was gcc 4.4, even old at the time. The reason is that distributions are reluctant to use the latest compilers and so users are stuck with old technologies.

How bad can it be? Well you lose out on c++11, to compile c++11 code you need gcc 4.8.1 or later for everything to work. Earlier versions of gcc can handle some of the features, but it’s not worth using them. I mean as the name implies c++11 was ready four years ago!

With c++11 I find myself much more productive as a programmer, with nice things built, good abstractions that get compiled to very fast code. I’m never going back. All of this comes with the cost of having to use a decent compiler. This means that you lose a lot of users because they can’t compile your code and the only way out is to release a binary version.

Binaries on Linux

Linux is wonderful, but it’s not the happy wonderland we would like to believe because distributions don’t always use the same versions of libraries, mostly libc, pthreads and friends. So when you create a binary to distribute you have to statically link against all your included libraries because you can’t rely on them being installed on the target system.

I ran into this problem when trying to create a binary for kallisto. The problem was that the HDF5 library had to be statically linked, but the pthreads library had to be dynamically linked. I spent a few nights trying to get it to work but nothing worked.

Holy Build Box

Then, on twitter, Heng Li tweeted about the Holy Build Box that promised a way to build binaries on linux that would work on any system. Long story short, it worked and here’s the recipe.

First the solution is built on top of docker, so you need a machine where you are root. I had an Ubuntu 14 desktop lying around so I installed the latest docker version and started reading the manual. If you are trying to build your own binaries I would recommend starting with the script below, modifying it as appropriate and reading through the manual.

Building kallisto

To build the binary we simply use a build script that takes care of everything, I only assume that I have the source files and the hdf5 library files in the current folder.

The first script is comp_kallisto.sh

docker run -t -i --rm \
  -v `pwd`:/io \
  -e CC='ccache gcc' \
  -e CXX='ccache g++' \
  -e CCACHE_DIR='/io/cache' \
  phusion/holy-build-box-64:latest \
  bash /io/build.sh $1

It takes as argument the version and passes it on to the build script build.sh. The -t -i --rm arguments show the output of the docker run and tear down the instance when we’re done, -v `pwd`:/io mounts the current directory as /io so we can transfer files in and out between the machine and the docker instance. The -e parameters set environment variables to cache the compilation, this is useful if you need to compile hdf5 over and over again until everything works. phusion/holy-build-box-64:latest is the name of the docker image, if you don’t have it docker will download it which should only take a few minutes.

What this image has is an old Centos 5 system with the latest compiler (well gcc 4.8.2 anyway) libstdc++ available only as static and fairly old libc and phtreads libraries. This means that if we link against those the binary will work on any newer system.

The build.sh script does all the work

#!/bin/bash
set -e

VER=$1
if [[ -z $VER ]]; then
  echo "specify version"
  exit;
fi

# Activate Holy Build Box environment.
source /hbb_exe/activate

set -x

# Install static hdf5
tar xfz /io/hdf5-1.8.*
cd hdf5-1.8.*
env CFLAGS="$STATICLIB_CFLAGS" CXXFLAGS="$STATICLIB_CXXFLAGS" \
  ./configure --prefix=/hbb_exe --disable-parallel \
  --without-szlib --without-pthread --disable-shared
make
make install
cd ..

# Extract and enter source
tar xzf /io/v${VER}.tar.gz
cd kallisto-${VER}

# compile and install kallisto
mkdir build
cd build
cmake ..
make
make install

# copy the binary back
cp /usr/local/bin/kallisto /io/kallisto_bin_${VER}

Since we’ve mounted the current directory as /io we simply extract the hdf5 code, build a static library with the correct parameters and install system wide. We don’t have to worry about cleaning up the mess because this “system” is going down in a minute. Plus inside docker we are root so we can do whatever we feel like. Once the hdf5 library is ready all the dependencies of kallisto are satisfied and we build it from the source and install the binary. Finally we copy the new binary back to the /io folder and exit the docker image. Once the script fininshes, the docker image goes down and with it all changes made. The end result is a binary that works on any linux system.

The Holy Build Box comes with some static libraries such as libz, but anything else you’ll have to build yourself. You can also run docker directly and get a shell just by running bash instead of bash build.sh and try it out there manually.

Advertisements