Edge cases in de Bruijn graphs

In the previous post I showed a simple implementation of a de Bruijn graph assembler. This implementation is mostly correct, for some definition of mostly.

One of the key differences between the original combinatorial definition of a de Bruijn graph and what we use for assembly is that a) the de Bruijn graph contains every k-mer and what we use for assembly is just a subgraph of the big one (this is just a technicality) but more importantly b) we make no distinction between a k-mer and its reverse complement, frequently referred to as its twin in assembly-lingo.

This gluing of a k-mer and its twin means that we can do one of two things. Either we can keep both copies and store the directed graph.

kmers1

This is much easier to work with and think about since we can see that the edges go in opposite directions. So if we have an x->y edge then we must have a twin(y)->twin(y) edge as well (shown in red). As a convenient notation I’ll refer to the reverse complement of x by ~x from now on. Also if k is an odd number then every k-mer is distinct from its twin (to see why just think about the middle character).

The downside is the memory usage is doubled. Since memory is a huge issue in assembly we generally want to keep only track of one of the two k-mers. This is done by appointing one of them to be the representative k-mer, usually the lexicographically smaller of the two (a fancy way of saying whichever comes first in alphabetical order), but as long as it’s consistent it doesn’t matter how it’s done.

Although you don’t wan’t to represent it this way in code it is much easier to think about both copies of the k-mers, like above, rather than only the representative.

Edge cases

So when does the previous code not quite work. It has to do with the our non-existent, and therefore not rigorous, definition of a contig. A contig is a maximal linear path in the de Bruijn graph. One nice observation is that every k-mer, x, is contained in exactly one contig, and the same goes for the twin of x and it should be in the same contig. Note that a contig might just have one k-mer in it. The linear condition says that each internal vertex on the path has out-degree 1 and in-degree 1. To put this another way, x->y is in a contig if there are no other x->b or a->y edges.

This condition covers most cases, namely the ones where x,y,~x,~y are all distinct. Since k is odd we’re guaranteed that x and ~x are distinct. So what happens if x and y are the same, can that really happen? This would imply that the last k-1 bases of x are equal to first k-1 bases of x. The only way this can happen is if we have x="AAA...A" or any other repetition.

The other case is if x and ~y are the same so x-> ~x. In this case the last k-1 bases of x are equal to the first k-1 bases of ~x. So the first base of x doesn’t matter, since it’s the complement of the last base of ~x.

However we have that x[1:] = (~x)[:k-1] = ~(x[1:]), I’m using python string slice notation where x[1:] is from position 1 (we start at 0) onwards and x[:k-1] is the first k-1 bases. This shows that the (k-1)-mer x[1:] is its own twin, which was exactly the possibility we wanted to avoid by choosing k to be odd. This also means that the (k+1)-mer of x+y[k] which represents the edge x->y is a self-twin as well. As an example we can have x="CAATT" so then ~x="AATTG" and then we can see that x->~x.

For even k we can characterize the k-mers that are self-twins by specifying the first k/2 bases and the remaining k/2 will be the reverse complement of the first k/2, this means that there are 4^{k/2} such k-mers. So the number of k-mers is 4^k, and for k odd we have 4^{k}/2 distinct representatives, but for k even we have (4^{k-1}-4^{k/2})/2+4^{k/2} = (4^{k}+2^k)/2 distinct representatives. [edit 4/30/15, thanks to Lior Pachter for noticing an arithmetic bug]

But wait, there’s more

When we said that we made no distinction between x and ~x and treat them equally we have to be careful when defining edges and degrees and actually start using bidirected edges, but that I’ll leave that as an exercise for the reader or a later post.

Next time we’ll see how to fix the code to handle these edge cases properly. There is an additional bug in the code, although you could say it’s a feature, it has to do with circular contigs and hats off to you if you can find it. [read the follow up]

Advertisements

2 thoughts on “Edge cases in de Bruijn graphs

  1. Pingback: deBuGging de Bruijn Graphs | Bits of Bioinformatics

  2. Pingback: Near-optimal RNA-Seq quantification with kallisto | Bits of DNA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s