deBuGging de Bruijn Graphs

In the previous posts I showed a toy implementation of a dBG assembler and then found that something was wrong with the edge cases. I’ll show how to fix these edge cases by testing this on circular contigs and something similar.

It turns out to be much easier to work with short reads and small values of k. In this case I’ll be working with k=5. Our test case will be one read for the entire contig

TTCAGCGATAACATCTTTCA

When we feed this to the assembler we get (I put cyclic.fq twice since by default I’m filtering k-mers with a coverage of 1)

$ python dbg.py 5 cyclic.fq cyclic.fq
>contig0
TCTTTCAGCGATAACATCTTTCAGCGATAACATCT

What?!? Ok, this is not what we wanted. The good news is that our contig seems to be inside the result.

TCTTTCAGCGATAACATCTTTCAGCGATAACATCT
   ||||||||||||||||||||
   TTCAGCGATAACATCTTTCA
                   ||||||||||||||||
                   TTCAGCGATAACATCTTTCA
|||||||
TCTTTCA

Only it’s there twice wrapped around the result. There are two places where this could have gone wrong
a) in the function get_contig_forward(d,km), b) in the function get_contig(d,km). We should first check that get_contig_forward is doing the right thing. We can run the python script and get an interactive shell afterwards by doing

$ python -i dbg.py 5 cyclic.fq cyclic.fq
>contig0
TCTTTCAGCGATAACATCTTTCAGCGATAACATCT

>>>

The -i keeps the python session open with all the variables and functions in place. We can check the function by doing

>>> get_contig_forward(d,'TTCAG')
['TTCAG', 'TCAGC', 'CAGCG', 'AGCGA', 'GCGAT', 'CGATA', 'GATAA', 'ATAAC', 'TAACA', 'AACAT', 'ACATC', 'CATCT', 'ATCTT', 'TCTTT', 'CTTTC', 'TTTCA']

Which are the k-mers in "TTCAGCGATAACATCTTTCA", ok, so nothing odd here.

Now get_contig(d,km) has two function calls to get_contig_forward. We assume that km is in the middle of some contig, so we get the forward extension with

c_fw = get_contig_forward(d,km)

which should be the part of the contig from km to the end. We get the backward extension with

c_bw = get_contig_forward(d,twin(km))

we reuse the forward function and call it with twin(km) so that we get the reverse complement of the first part of the contig. Now we see why this fails since,

>>> get_contig_forward(d,twin('TTCAG'))
['CTGAA', 'TGAAA', 'GAAAG', 'AAAGA', 'AAGAT', 'AGATG', 'GATGT', 'ATGTT', 'TGTTA', 'GTTAT', 'TTATC', 'TATCG', 'ATCGC', 'TCGCT', 'CGCTG', 'GCTGA']

which is the contig for "CTGAAAGATGTTATCGCTGA", the reverse complement of the contig we started with. So we need to fix this by checking if this is a circular contig.

But before we fix this we should fix a nastier error.

Our test case is a tiny contig "ACGTTGCA" and when we run our program the output is "TTGCAACGTTGCAACGTTG"

Which is really wrong. Checking get_contig_forward as before the output is

>>> get_contig_forward(d,'TTGCA')
['TTGCA', 'TGCAA', 'GCAAC', 'CAACG', 'AACGT', 'ACGTT', 'CGTTG', 'GTTGC']

which corresponds to the contig "TGCAATGCAACGT". But the result should only have been 4 k-mers, but we got 8? In fact the dictionary for all k-mers has only 8 k-mers and that’s counting both forward and reverse. If you look at the original contig "ACGTTGCA", you’ll see that 'TTGCA' is the last k-mer and the next k-mer in the output is 'TGCAA' which is the reverse complement. If you’ll remember from the last post the 6-mer 'TTGCAA' is its own reverse complement, so what we have here is really a hairpin. But it’s not just a hairpin because the first k-mer 'ACGTT' has 'AACGT' as a neighbor so there are two hairpins at both ends of the k-mer.

My head hurts, lets draw this

kmers2

ok, so now we see what just happened. We started in the top from left to right and went backwards in the reverse complement and round again. The check for loops in get_contig_forward was to see if the next k-mer to add was already in the contig, we should have checked for the twin as well.

Now if we redraw the figure by dropping the twins of each k-mer, but keeping the edges shown in red we get.

So this is not a circular contig in the regular sense, but sort of “first you go one round on one side and then another on the other side, only to come back at the beginning”. Hah! I know lets call it a Möbius contig.

Fixing things.

We update the get_contig_forward definition from

def get_contig_forward(d,km):
    c_fw = [km]
    
    while True:
        if sum(x in d for x in fw(c_fw[-1])) != 1:
            break
        
        cand = [x for x in fw(c_fw[-1]) if x in d][0]
        if cand in c_fw:
            break
        
        if sum(x in d for x in bw(cand)) != 1:
            break

        c_fw.append(cand)

    return c_fw

to

def get_contig_forward(d,km):
    c_fw = [km]
    
    while True:
        if sum(x in d for x in fw(c_fw[-1])) != 1:
            break
        
        cand = [x for x in fw(c_fw[-1]) if x in d][0]
        if cand == km or cand == twin(km):
            break # break out of cycles or mobius contigs
        if cand == twin(c_fw[-1]):
            break # break out of hairpins
        
        if sum(x in d for x in bw(cand)) != 1:
            break

        c_fw.append(cand)

    return c_fw

plus it is a bit less inefficient since we only compare cand to the last or first k-mer to the candidate extension.

We also fix get_contig by checking if the last k-mer in the forwards extension is a neighbor of the first one.

From

def get_contig(d,km):
    c_fw = get_contig_forward(d,km)
    c_bw = get_contig_forward(d,twin(km))

    c = [twin(x) for x in c_bw[-1:0:-1]] + c_fw
    s = c[0] + ''.join(x[-1] for x in c[1:])
    return s,c

To the slightly more readable

def contig_to_string(c):
    return c[0] + ''.join(x[-1] for x in c[1:])

def get_contig(d,km):
    c_fw = get_contig_forward(d,km)
    
    c_bw = get_contig_forward(d,twin(km))

    if km in fw(c_fw[-1]):
        c = c_fw
    else:
        c = [twin(x) for x in c_bw[-1:0:-1]] + c_fw
    return contig_to_string(c),c

The source code has been updated on github as well as the test cases above. Now for next time we’ll, uh, I have no idea, just tell my on twitter: @pmelsted

Advertisements