# Naive python implementation of a de Bruijn Graph

De Bruijn graphs are widely used in assembly algorithms and I guess writing your own assembler is a rite of passage. Hoping to get some answers I got this question from Mick Watson. Since I had a script lying around that I used for validation I thought I would share it.

So let’s see how this is done. First we’ll need to work with k-mers, which are substrings of a fixed length k. Since we’re keeping this as simple as possible we’ll use python strings for k-mers. Some functions to work with those

```import collections, sys
from Bio import Seq, SeqIO, SeqRecord

def twin(km):
return Seq.reverse_complement(km)

def kmers(seq,k):
for i in xrange(len(seq)-k+1):
yield seq[i:i+k]

def fw(km):
for x in 'ACGT':
yield km[1:]+x

def bw(km):
for x in 'ACGT':
yield x + km[:-1]
```

The yield statement in python gives us an iterator object, so we can write stuff like

```for x in fw("ATAT"):
print x
```

which will print TATA, TATC, TATG and TATT, i.e. all forward neighbors of the k-mer “ATAT”. If we need to convert the iterator to a list the easiest thing to do is `list(fw("ATAT"))`.

To keep track of all the k-mers we store them in a dictionary that keeps track of the coverage. For each k-mer we’ll add the reverse complement as well.

```def build(fn,k=31,limit=1):
d = collections.defaultdict(int)

for f in fn:
reads = SeqIO.parse(f,'fastq')
for read in reads:
seq_s = str(read.seq)
seq_l = seq_s.split('N')
for seq in seq_l:
for km in kmers(seq,k):
d[km] +=1
seq = twin(seq)
for km in kmers(seq,k):
d[km] += 1

d1 = [x for x in d if d[x] <= limit]
for x in d1:
del d[x]

return d
```

Line 2 defines a default dictionary of integers. This means that if the k-mer is not in the dictionary when we try to ask for the value, it is inserted with a default value for int, which is 0. If you would need a different default value you could write `defaultdict(lambda : 42)`.

In lines 5-6 we read our input, a FASTQ file presumably and iterate over the reads. The reads might contains N’s so we’ll skip over those by breaking up the reads on N’s. In line 10 we iterate over all k-mers in the read and increase the coverage by 1. Note that we don’t have to check if it is in the dictionary, this is the beauty of the defaultdict. We’ll repeat this for the reverse complement of the read as well.

The last lines throw out low coverage k-mers, if you don’t want this set limit to 0 or delete the lines.

```def get_contig_forward(d,km):
c_fw = [km]

while True:
if sum(x in d for x in fw(c_fw[-1])) != 1:
break

cand = [x for x in fw(c_fw[-1]) if x in d]
if cand in c_fw:
break

if sum(x in d for x in bw(cand)) != 1:
break

c_fw.append(cand)

return c_fw
```

This function walks along the de Bruijn graph in the forward direction along a contig. The list `c_fw` is a list of k-mers in the contig, initialized with `km`. Line 5 is doing a lot of things at once, but here it goes. First `fw(c_fw[-1])` gives all the forward neighbors of the last k-mer in the contig so far, `x in d for x in ...` gives a list (actually an iterator) of `True, False ` values and `sum(...)` gives the number of `True` values. This is really the forward-degree of the last k-mer. If it is not equal to one, then this is the end of the contig.

Next `cand = [x for x in fw(c_fw[-1]) if x in d]` picks the single k-mer that is the forward neighbor. Now to avoid going in cycles we need to check if this k-mer is already in the contig which we do in lines 9-10, we could just check the first k-mer, but I prefer to cross the t’s and dot the i’s at least 3-4 times just to be sure. Lines 12-13 just check if the forward candidate `cand` has a single backwards neighbor, namely the last element of the list. If this is the case, then we can extend our contig by one more k-mer, which is done by appending it to the list.

This only goes in the forward direction, we could write the same code to go backwards, but I’m feeling lazy. So we’ll just go forward from the reverse complement (or twin) of the k-mer which is the same thing.

```def get_contig(d,km):
c_fw = get_contig_forward(d,km)
c_bw = get_contig_forward(d,twin(km))

c = [twin(x) for x in c_bw[-1:0:-1]] + c_fw
s = c + ''.join(x[-1] for x in c[1:])
return s,c
```

This function finds the contig that the k-mer `km` belongs to. We return the string and a list of k-mer in the contig. To get the k-mers in c_bw in a reverse order we use python’s slice notation `c_bw[-1:0:-1]` which reads, start at the end (-1), go to the beginning (0) and decrease the index by 1 (-1). This code `[twin(x) for x in c_bw[-1:0:-1]] + c_fw` then gives you all the k-mers in the contig. To turn them into text we simply concatenate the first k-mer and the last character of all the other k-mers. Remember to use `''.join(...)` for efficient string concatenation in python.

```def all_contigs(d):
done = set()
r = []
for x in d:
if x not in done:
s,c = get_contig(d,x)
for y in c:
done.add(y)
done.add(twin(y))
r.append(s)
return r
```

This code simply iterates over all the k-mers in the dictionary and keeps track of which k-mers have been dealt with in the set `done`. For each k-mer we get the contig it belongs to and add all the k-mers in that contig to the `done` set. Repeat until we’re done.

```def print_dbg(cs):
for i,x in enumerate(cs):
print('>contig%d\n%s\n'%(i,x))
```

The last piece of code just takes the contigs and writes them to a FASTA file.

Tie this all together with a “main” function in python

```
if __name__ == "__main__":
k = int(sys.argv)
d = build(sys.argv[2:],k,0)
cs = all_contigs(d)
print_dbg(cs)
```

Run this with `python dbg.py 31 read1.fastq read2.fastq ...`

For the full source code as well as updates once people complain or suggest improvements look at the github repository https://github.com/pmelsted/dbg

Advertisements

## 3 thoughts on “Naive python implementation of a de Bruijn Graph”

1. Reblogged this on BioInfoExpert and commented:
Neat implementation of simple de Bruijn assembler in Python