BamHash and some insights from information theory

We have a new preprint out for a tool called BamHash. I’ll get to what it does and how it works in a bit, but first the why.

This tool was built because we needed it at deCODE and couldn’t find anything else that could do it. The basic problem is that we have DNA sequenced on an Illumina machine producing FASTQ files, which are then pushed into the pipeline, processed, cuddled, fed locally grown produce and after some bioinformatics magic we get a shiny BAM file.

FASTQ files storing raw reads are big and ugly, whereas BAM files are small, sorted, indexed and just better. For a 30x coverage Human genome the BAM file will be about 70G, the raw FASTQ file is even worse. Multiply that by the number of individuals sequenced and you get, uh, a lot.

Obviously we want to get rid of the FASTQ file, and it is perfectly safe to do that because the BAM contains all the information in the FASTQ file. Worst case scenario: we could always recreate the FASTQ file from the BAM file if needed.

To illustrate the question consider the following conversation

paranoid me: is it safe to delete that FASTQ file?
me: what?!?, I just explained why it was safe why are you asking again?
paranoid me: well, something could have happened in your pipeline
me: is there something wrong with my pipeline?
paranoid me: or the disk system
me: I haven’t heard of anything, do you know something I don’t?
paranoid me: maybe there were hiccups and you dropped a few reads
me: no, you don’t understand, we have a pipeline
paranoid me: can you really be sure that all the reads in the FASTQ file made it to the BAM file?
me: now you are just being paranoid
paranoid me: “Yes I’m paranoid – but am I paranoid enough?” [source]

So how can we know if the reads in the FASTQ file are exactly the same as the reads in the BAM file? A simple solution is to just compare the two sets, but that would take too much time, plus we would have to keep everything in memory or write to disk and it’s really an overkill.

A simpler way is to construct fingerprints for the FASTQ and BAM files based only on the set of reads. The exact same reads appear in both files, but in different order. So the basic idea is to compute a hash value for each read and reduce the set of hash values to a single fingerprint. Reduce here means to apply some reduction operation such as a sum or XOR. As long as the binary operation is commutative the final result is independent of the order. Thus we settled on something simple, MD5 has of each read and add up all the values mod 2^64. Change one byte in your reads and you’ll get a different value.

The program is very efficient, on par with reading a BAM file with samtools and counting the lines. Now just stick this at the front and end of your pipeline and check that the input FASTQ and output BAM have the same fingerprint. If they match, delete that sucker. If they don’t, you have issues.

But wait there’s more

I was describing this result to Harold Pimentel who asked a natural follow-up question “since we like to throw away reads, is it possible to find if the FASTQ reads are a subset of the BAM reads”. The easy answer is yes, just do the naive thing and compare the sets directly. So the hard question is, can we find a fingerprinting method that will solve this problem.

I had no idea whether this was possible until I reframed the question. Instead of thinking about this as a computational problem let’s think about this as a communication problem from information theory.

Alice and Bob start doing bioinformatics

In this setting we have Alice holding a FASTQ file with n reads, A and BOB has a BAM file with m reads, B. They want to find out if B\subseteq A using minimal communication. This is equivalent to finding out if B\cap\bar{A} = \emptyset. Any fingerprint method can be turned into a protocol where Alice sends the fingerprint and Bob compares the fingerprints and answers yes or now. The general setting where Alice and Bob can talk interactively allows for more complicated methods. [edit: the direction was wrong, BAM are supposed to be subsets of FASTQ in a practical setting, for the lower bound you just reverse the roles of Alice and Bob, they won’t mind]

The good news is (if you are a theorist, otherwise this is bad news) that this is simply the set disjointness problem and it turns out that the lower bound for the amount of communication that is needed is \Omega(n), even if we allow randomness and some small probability of failure.

In other words, no, there is no way any general fingerprinting method can find out if the BAM reads are a subset of the FASTQ reads. The fingerprint would have to be the size of the input which defeats the purpose. I added the word general, since there might be something clever you can do if you were told that only a handful of reads are missing or if you knew something more about the input, but let’s not go there.

BamHash ~ EQ

If we go back to BamHash, the fingerprinting method is answering the question whether A=B, which it called the EQ problem. The EQ problem has a lower bound of \Omega(n) for deterministic communication, i.e. no randomness and zero probability of failure. However the EQ problem has a O(\log(n)) bit solution when we allow randomness, and this is pretty much what BamHash is doing.