De Bruijn graphs are widely used in assembly algorithms and I guess writing your own assembler is a rite of passage. Hoping to get some answers I got this question from Mick Watson. Since I had a script lying around that I used for validation I thought I would share it.
For contig/string graphs, what is the advantage of having the sequence in edges (like FASTG) rather than the vertices (de Bruijn style) ?—
Pall Melsted (@pmelsted) November 23, 2013
So let’s see how this is done. First we’ll need to work with k-mers, which are substrings of a fixed length k. Since we’re keeping this as simple as possible we’ll use python strings for k-mers. Some functions to work with those
import collections, sys from Bio import Seq, SeqIO, SeqRecord def twin(km): return Seq.reverse_complement(km) def kmers(seq,k): for i in xrange(len(seq)-k+1): yield seq[i:i+k] def fw(km): for x in 'ACGT': yield km[1:]+x def bw(km): for x in 'ACGT': yield x + km[:-1]
The yield statement in python gives us an iterator object, so we can write stuff like
for x in fw("ATAT"): print x
which will print TATA, TATC, TATG and TATT, i.e. all forward neighbors of the k-mer “ATAT”. If we need to convert the iterator to a list the easiest thing to do is
To keep track of all the k-mers we store them in a dictionary that keeps track of the coverage. For each k-mer we’ll add the reverse complement as well.
def build(fn,k=31,limit=1): d = collections.defaultdict(int) for f in fn: reads = SeqIO.parse(f,'fastq') for read in reads: seq_s = str(read.seq) seq_l = seq_s.split('N') for seq in seq_l: for km in kmers(seq,k): d[km] +=1 seq = twin(seq) for km in kmers(seq,k): d[km] += 1 d1 = [x for x in d if d[x] <= limit] for x in d1: del d[x] return d
Line 2 defines a default dictionary of integers. This means that if the k-mer is not in the dictionary when we try to ask for the value, it is inserted with a default value for int, which is 0. If you would need a different default value you could write
defaultdict(lambda : 42).
In lines 5-6 we read our input, a FASTQ file presumably and iterate over the reads. The reads might contains N’s so we’ll skip over those by breaking up the reads on N’s. In line 10 we iterate over all k-mers in the read and increase the coverage by 1. Note that we don’t have to check if it is in the dictionary, this is the beauty of the defaultdict. We’ll repeat this for the reverse complement of the read as well.
The last lines throw out low coverage k-mers, if you don’t want this set limit to 0 or delete the lines.
def get_contig_forward(d,km): c_fw = [km] while True: if sum(x in d for x in fw(c_fw[-1])) != 1: break cand = [x for x in fw(c_fw[-1]) if x in d] if cand in c_fw: break if sum(x in d for x in bw(cand)) != 1: break c_fw.append(cand) return c_fw
This function walks along the de Bruijn graph in the forward direction along a contig. The list
c_fw is a list of k-mers in the contig, initialized with
km. Line 5 is doing a lot of things at once, but here it goes. First
fw(c_fw[-1]) gives all the forward neighbors of the last k-mer in the contig so far,
x in d for x in ... gives a list (actually an iterator) of
True, False values and
sum(...) gives the number of
True values. This is really the forward-degree of the last k-mer. If it is not equal to one, then this is the end of the contig.
cand = [x for x in fw(c_fw[-1]) if x in d] picks the single k-mer that is the forward neighbor. Now to avoid going in cycles we need to check if this k-mer is already in the contig which we do in lines 9-10, we could just check the first k-mer, but I prefer to cross the t’s and dot the i’s at least 3-4 times just to be sure. Lines 12-13 just check if the forward candidate
cand has a single backwards neighbor, namely the last element of the list. If this is the case, then we can extend our contig by one more k-mer, which is done by appending it to the list.
This only goes in the forward direction, we could write the same code to go backwards, but I’m feeling lazy. So we’ll just go forward from the reverse complement (or twin) of the k-mer which is the same thing.
def get_contig(d,km): c_fw = get_contig_forward(d,km) c_bw = get_contig_forward(d,twin(km)) c = [twin(x) for x in c_bw[-1:0:-1]] + c_fw s = c + ''.join(x[-1] for x in c[1:]) return s,c
This function finds the contig that the k-mer
km belongs to. We return the string and a list of k-mer in the contig. To get the k-mers in c_bw in a reverse order we use python’s slice notation
c_bw[-1:0:-1] which reads, start at the end (-1), go to the beginning (0) and decrease the index by 1 (-1). This code
[twin(x) for x in c_bw[-1:0:-1]] + c_fw then gives you all the k-mers in the contig. To turn them into text we simply concatenate the first k-mer and the last character of all the other k-mers. Remember to use
''.join(...) for efficient string concatenation in python.
def all_contigs(d): done = set() r =  for x in d: if x not in done: s,c = get_contig(d,x) for y in c: done.add(y) done.add(twin(y)) r.append(s) return r
This code simply iterates over all the k-mers in the dictionary and keeps track of which k-mers have been dealt with in the set
done. For each k-mer we get the contig it belongs to and add all the k-mers in that contig to the
done set. Repeat until we’re done.
def print_dbg(cs): for i,x in enumerate(cs): print('>contig%d\n%s\n'%(i,x))
The last piece of code just takes the contigs and writes them to a FASTA file.
Tie this all together with a “main” function in python
if __name__ == "__main__": k = int(sys.argv) d = build(sys.argv[2:],k,0) cs = all_contigs(d) print_dbg(cs)
Run this with
python dbg.py 31 read1.fastq read2.fastq ...
For the full source code as well as updates once people complain or suggest improvements look at the github repository https://github.com/pmelsted/dbg