In the previous posts I showed a toy implementation of a dBG assembler and then found that something was wrong with the edge cases. I’ll show how to fix these edge cases by testing this on circular contigs and something similar.
It turns out to be much easier to work with short reads and small values of k. In this case I’ll be working with k=5. Our test case will be one read for the entire contig
TTCAGCGATAACATCTTTCA
When we feed this to the assembler we get (I put cyclic.fq twice since by default I’m filtering k-mers with a coverage of 1)
$ python dbg.py 5 cyclic.fq cyclic.fq >contig0 TCTTTCAGCGATAACATCTTTCAGCGATAACATCT
What?!? Ok, this is not what we wanted. The good news is that our contig seems to be inside the result.
TCTTTCAGCGATAACATCTTTCAGCGATAACATCT |||||||||||||||||||| TTCAGCGATAACATCTTTCA |||||||||||||||| TTCAGCGATAACATCTTTCA ||||||| TCTTTCA
Only it’s there twice wrapped around the result. There are two places where this could have gone wrong
a) in the function get_contig_forward(d,km)
, b) in the function get_contig(d,km)
. We should first check that get_contig_forward is doing the right thing. We can run the python script and get an interactive shell afterwards by doing
$ python -i dbg.py 5 cyclic.fq cyclic.fq >contig0 TCTTTCAGCGATAACATCTTTCAGCGATAACATCT >>>
The -i keeps the python session open with all the variables and functions in place. We can check the function by doing
>>> get_contig_forward(d,'TTCAG') ['TTCAG', 'TCAGC', 'CAGCG', 'AGCGA', 'GCGAT', 'CGATA', 'GATAA', 'ATAAC', 'TAACA', 'AACAT', 'ACATC', 'CATCT', 'ATCTT', 'TCTTT', 'CTTTC', 'TTTCA']
Which are the k-mers in "TTCAGCGATAACATCTTTCA"
, ok, so nothing odd here.
Now get_contig(d,km)
has two function calls to get_contig_forward
. We assume that km
is in the middle of some contig, so we get the forward extension with
c_fw = get_contig_forward(d,km)
which should be the part of the contig from km to the end. We get the backward extension with
c_bw = get_contig_forward(d,twin(km))
we reuse the forward function and call it with twin(km) so that we get the reverse complement of the first part of the contig. Now we see why this fails since,
>>> get_contig_forward(d,twin('TTCAG')) ['CTGAA', 'TGAAA', 'GAAAG', 'AAAGA', 'AAGAT', 'AGATG', 'GATGT', 'ATGTT', 'TGTTA', 'GTTAT', 'TTATC', 'TATCG', 'ATCGC', 'TCGCT', 'CGCTG', 'GCTGA']
which is the contig for "CTGAAAGATGTTATCGCTGA"
, the reverse complement of the contig we started with. So we need to fix this by checking if this is a circular contig.
But before we fix this we should fix a nastier error.
Our test case is a tiny contig "ACGTTGCA"
and when we run our program the output is "TTGCAACGTTGCAACGTTG"
Which is really wrong. Checking get_contig_forward
as before the output is
>>> get_contig_forward(d,'TTGCA') ['TTGCA', 'TGCAA', 'GCAAC', 'CAACG', 'AACGT', 'ACGTT', 'CGTTG', 'GTTGC']
which corresponds to the contig "TGCAATGCAACGT"
. But the result should only have been 4 k-mers, but we got 8? In fact the dictionary for all k-mers has only 8 k-mers and that’s counting both forward and reverse. If you look at the original contig "ACGTTGCA"
, you’ll see that 'TTGCA'
is the last k-mer and the next k-mer in the output is 'TGCAA'
which is the reverse complement. If you’ll remember from the last post the 6-mer 'TTGCAA'
is its own reverse complement, so what we have here is really a hairpin. But it’s not just a hairpin because the first k-mer 'ACGTT'
has 'AACGT'
as a neighbor so there are two hairpins at both ends of the k-mer.
My head hurts, lets draw this
ok, so now we see what just happened. We started in the top from left to right and went backwards in the reverse complement and round again. The check for loops in get_contig_forward
was to see if the next k-mer to add was already in the contig, we should have checked for the twin as well.
Now if we redraw the figure by dropping the twins of each k-mer, but keeping the edges shown in red we get.
So this is not a circular contig in the regular sense, but sort of “first you go one round on one side and then another on the other side, only to come back at the beginning”. Hah! I know lets call it a Möbius contig.
Fixing things.
We update the get_contig_forward
definition from
def get_contig_forward(d,km): c_fw = [km] while True: if sum(x in d for x in fw(c_fw[-1])) != 1: break cand = [x for x in fw(c_fw[-1]) if x in d][0] if cand in c_fw: break if sum(x in d for x in bw(cand)) != 1: break c_fw.append(cand) return c_fw
to
def get_contig_forward(d,km): c_fw = [km] while True: if sum(x in d for x in fw(c_fw[-1])) != 1: break cand = [x for x in fw(c_fw[-1]) if x in d][0] if cand == km or cand == twin(km): break # break out of cycles or mobius contigs if cand == twin(c_fw[-1]): break # break out of hairpins if sum(x in d for x in bw(cand)) != 1: break c_fw.append(cand) return c_fw
plus it is a bit less inefficient since we only compare cand to the last or first k-mer to the candidate extension.
We also fix get_contig
by checking if the last k-mer in the forwards extension is a neighbor of the first one.
From
def get_contig(d,km): c_fw = get_contig_forward(d,km) c_bw = get_contig_forward(d,twin(km)) c = [twin(x) for x in c_bw[-1:0:-1]] + c_fw s = c[0] + ''.join(x[-1] for x in c[1:]) return s,c
To the slightly more readable
def contig_to_string(c): return c[0] + ''.join(x[-1] for x in c[1:]) def get_contig(d,km): c_fw = get_contig_forward(d,km) c_bw = get_contig_forward(d,twin(km)) if km in fw(c_fw[-1]): c = c_fw else: c = [twin(x) for x in c_bw[-1:0:-1]] + c_fw return contig_to_string(c),c
The source code has been updated on github as well as the test cases above. Now for next time we’ll, uh, I have no idea, just tell my on twitter: @pmelsted
Pingback: Edge cases in de Bruijn graphs | Bits of Bioinformatics