Edge cases in de Bruijn graphs

In the previous post I showed a simple implementation of a de Bruijn graph assembler. This implementation is mostly correct, for some definition of mostly.

One of the key differences between the original combinatorial definition of a de Bruijn graph and what we use for assembly is that a) the de Bruijn graph contains every k-mer and what we use for assembly is just a subgraph of the big one (this is just a technicality) but more importantly b) we make no distinction between a k-mer and its reverse complement, frequently referred to as its twin in assembly-lingo.

This gluing of a k-mer and its twin means that we can do one of two things. Either we can keep both copies and store the directed graph.

kmers1

This is much easier to work with and think about since we can see that the edges go in opposite directions. So if we have an x->y edge then we must have a twin(y)->twin(y) edge as well (shown in red). As a convenient notation I’ll refer to the reverse complement of x by ~x from now on. Also if k is an odd number then every k-mer is distinct from its twin (to see why just think about the middle character).

The downside is the memory usage is doubled. Since memory is a huge issue in assembly we generally want to keep only track of one of the two k-mers. This is done by appointing one of them to be the representative k-mer, usually the lexicographically smaller of the two (a fancy way of saying whichever comes first in alphabetical order), but as long as it’s consistent it doesn’t matter how it’s done.

Although you don’t wan’t to represent it this way in code it is much easier to think about both copies of the k-mers, like above, rather than only the representative.

Edge cases

So when does the previous code not quite work. It has to do with the our non-existent, and therefore not rigorous, definition of a contig. A contig is a maximal linear path in the de Bruijn graph. One nice observation is that every k-mer, x, is contained in exactly one contig, and the same goes for the twin of x and it should be in the same contig. Note that a contig might just have one k-mer in it. The linear condition says that each internal vertex on the path has out-degree 1 and in-degree 1. To put this another way, x->y is in a contig if there are no other x->b or a->y edges.

This condition covers most cases, namely the ones where x,y,~x,~y are all distinct. Since k is odd we’re guaranteed that x and ~x are distinct. So what happens if x and y are the same, can that really happen? This would imply that the last k-1 bases of x are equal to first k-1 bases of x. The only way this can happen is if we have x="AAA...A" or any other repetition.

The other case is if x and ~y are the same so x-> ~x. In this case the last k-1 bases of x are equal to the first k-1 bases of ~x. So the first base of x doesn’t matter, since it’s the complement of the last base of ~x.

However we have that x[1:] = (~x)[:k-1] = ~(x[1:]),¬†I’m using python string slice notation where x[1:] is from position 1 (we start at 0) onwards and x[:k-1] is the first k-1 bases. This shows that the (k-1)-mer x[1:] is its own twin, which was exactly the possibility we wanted to avoid by choosing k to be odd. This also means that the (k+1)-mer of x+y[k] which represents the edge x->y is a self-twin as well. As an example we can have x="CAATT" so then ~x="AATTG" and then we can see that x->~x.

For even k we can characterize the k-mers that are self-twins by specifying the first k/2 bases and the remaining k/2 will be the reverse complement of the first k/2, this means that there are 4^{k/2} such k-mers. So the number of k-mers is 4^k, and for k odd we have 4^{k}/2 distinct representatives, but for k even we have (4^{k-1}-4^{k/2})/2+4^{k/2} = (4^{k}+2^k)/2 distinct representatives. [edit 4/30/15, thanks to Lior Pachter for noticing an arithmetic bug]

But wait, there’s more

When we said that we made no distinction between x and ~x and treat them equally we have to be careful when defining edges and degrees and actually start using bidirected edges, but that I’ll leave that as an exercise for the reader or a later post.

Next time we’ll see how to fix the code to handle these edge cases properly. There is an additional bug in the code, although you could say it’s a feature, it has to do with circular contigs and hats off to you if you can find it. [read the follow up]

Advertisements

Naive python implementation of a de Bruijn Graph

De Bruijn graphs are widely used in assembly algorithms and I guess writing your own assembler is a rite of passage. Hoping to get some answers I got this question from Mick Watson. Since I had a script lying around that I used for validation I thought I would share it.

So let’s see how this is done. First we’ll need to work with k-mers, which are substrings of a fixed length k. Since we’re keeping this as simple as possible we’ll use python strings for k-mers. Some functions to work with those

import collections, sys
from Bio import Seq, SeqIO, SeqRecord

def twin(km):
    return Seq.reverse_complement(km)

def kmers(seq,k):
    for i in xrange(len(seq)-k+1):
        yield seq[i:i+k]

def fw(km):
    for x in 'ACGT':
        yield km[1:]+x

def bw(km):
    for x in 'ACGT':
        yield x + km[:-1]

The yield statement in python gives us an iterator object, so we can write stuff like

for x in fw("ATAT"):
    print x

which will print TATA, TATC, TATG and TATT, i.e. all forward neighbors of the k-mer “ATAT”. If we need to convert the iterator to a list the easiest thing to do is list(fw("ATAT")).

To keep track of all the k-mers we store them in a dictionary that keeps track of the coverage. For each k-mer we’ll add the reverse complement as well.

def build(fn,k=31,limit=1):
    d = collections.defaultdict(int)

    for f in fn:
        reads = SeqIO.parse(f,'fastq')
        for read in reads:
            seq_s = str(read.seq)
            seq_l = seq_s.split('N')
            for seq in seq_l:
                for km in kmers(seq,k):
                    d[km] +=1
                seq = twin(seq)
                for km in kmers(seq,k):
                    d[km] += 1

    d1 = [x for x in d if d[x] <= limit]
    for x in d1:
        del d[x]

    return d

Line 2 defines a default dictionary of integers. This means that if the k-mer is not in the dictionary when we try to ask for the value, it is inserted with a default value for int, which is 0. If you would need a different default value you could write defaultdict(lambda : 42).

In lines 5-6 we read our input, a FASTQ file presumably and iterate over the reads. The reads might contains N’s so we’ll skip over those by breaking up the reads on N’s. In line 10 we iterate over all k-mers in the read and increase the coverage by 1. Note that we don’t have to check if it is in the dictionary, this is the beauty of the defaultdict. We’ll repeat this for the reverse complement of the read as well.

The last lines throw out low coverage k-mers, if you don’t want this set limit to 0 or delete the lines.

def get_contig_forward(d,km):
    c_fw = [km]
    
    while True:
        if sum(x in d for x in fw(c_fw[-1])) != 1:
            break
        
        cand = [x for x in fw(c_fw[-1]) if x in d][0]
        if cand in c_fw:
            break
        
        if sum(x in d for x in bw(cand)) != 1:
            break

        c_fw.append(cand)

    return c_fw

This function walks along the de Bruijn graph in the forward direction along a contig. The list c_fw is a list of k-mers in the contig, initialized with km. Line 5 is doing a lot of things at once, but here it goes. First fw(c_fw[-1]) gives all the forward neighbors of the last k-mer in the contig so far, x in d for x in ... gives a list (actually an iterator) of True, False values and sum(...) gives the number of True values. This is really the forward-degree of the last k-mer. If it is not equal to one, then this is the end of the contig.

Next cand = [x for x in fw(c_fw[-1]) if x in d][0] picks the single k-mer that is the forward neighbor. Now to avoid going in cycles we need to check if this k-mer is already in the contig which we do in lines 9-10, we could just check the first k-mer, but I prefer to cross the t’s and dot the i’s at least 3-4 times just to be sure. Lines 12-13 just check if the forward candidate cand has a single backwards neighbor, namely the last element of the list. If this is the case, then we can extend our contig by one more k-mer, which is done by appending it to the list.

This only goes in the forward direction, we could write the same code to go backwards, but I’m feeling lazy. So we’ll just go forward from the reverse complement (or twin) of the k-mer which is the same thing.

def get_contig(d,km):
    c_fw = get_contig_forward(d,km)
    c_bw = get_contig_forward(d,twin(km))

    c = [twin(x) for x in c_bw[-1:0:-1]] + c_fw
    s = c[0] + ''.join(x[-1] for x in c[1:])
    return s,c

This function finds the contig that the k-mer km belongs to. We return the string and a list of k-mer in the contig. To get the k-mers in c_bw in a reverse order we use python’s slice notation c_bw[-1:0:-1] which reads, start at the end (-1), go to the beginning (0) and decrease the index by 1 (-1). This code [twin(x) for x in c_bw[-1:0:-1]] + c_fw then gives you all the k-mers in the contig. To turn them into text we simply concatenate the first k-mer and the last character of all the other k-mers. Remember to use ''.join(...) for efficient string concatenation in python.

def all_contigs(d):
    done = set()
    r = []
    for x in d:
        if x not in done:
            s,c = get_contig(d,x)
            for y in c:
                done.add(y)
                done.add(twin(y))
            r.append(s)
    return r

This code simply iterates over all the k-mers in the dictionary and keeps track of which k-mers have been dealt with in the set done. For each k-mer we get the contig it belongs to and add all the k-mers in that contig to the done set. Repeat until we’re done.

def print_dbg(cs):
    for i,x in enumerate(cs):
        print('>contig%d\n%s\n'%(i,x))

The last piece of code just takes the contigs and writes them to a FASTA file.

Tie this all together with a “main” function in python


if __name__ == "__main__":
    k = int(sys.argv[1])
    d = build(sys.argv[2:],k,0)
    cs = all_contigs(d)
    print_dbg(cs)

Run this with python dbg.py 31 read1.fastq read2.fastq ...

For the full source code as well as updates once people complain or suggest improvements look at the github repository https://github.com/pmelsted/dbg