Neat implementation of simple de Bruijn assembler in Python ]]>

1. The authors claim that there are some fundamental read lengths needed to assembly. This depends upon the genome being assembled. If we knew the genome being assembled, there would be no point to assembly :). If the paper had at least talked about estimating the read length needed to assemble from data at least, I’d retract my claim about its pointlessness.

2. The authors seem to restrict themselves to the case when assembly to one contig is possible. The authors seem to be claiming that the their algorithm is “optimal” because it can reconstruct sequences when they can be completely reconstructed from data. This is a laughable notion of optimality. As a consumer of assembly software in a job that has often required me to get assembly of organisms from high throughput data for the last few years years, this is a very rare event (especially for short reads, and anything beyond small bacteria from long reads). The question that I’d want the theorists to answer would be “What should I do to get the most out of the data I already have?.” This approach does not answer that question at all.

LW is a relevant statistic as I can get an estimate of the genome length and know how much I should I must sequence to have any hope of getting it. Of course it’s not exact, but captures some intuition. The calculations of this paper not only assume a uniform sampling model, but also are fairly useless to me as they are genome dependent.

]]>Since those assumptions are violated it is tempting to dismiss this as pointless math.

“All models are wrong, but some are useful” (paraphrasing G.E. Box) applies here, in this case if we assume independent reads the LW bound for the coverage necessary to observe all nucleotides in the genome becomes a lower bound. In the sense that if you vary the probability distribution the minimum coverage required is the one given by the LW.

In the previous comment you mentioned that assemblies were repeat limited, which is precisely the point of the paper linked here. If you consider Figure 2 http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-S5-S18#Fig2 you’ll see that they are considering not just coverage but the length of the reads necessary to beat the repeat regions. And not only lower bounds, but they show the performance of a few algorithms and notably the Multibridging algorithm can almost match the lower bounds presented.

Even in this model they assume that the reads are error free (they are not) and pretty long (2.5Kb). Both of these issues were considered in follow-up papers where the error rate is taken into account and similar results were obtained for paired reads.

I fail to see how this is pointless math.

]]>In practice for a large fraction of datasets are repeat limited. In this setting, I do not see the point of the coverage depth calculations (which are based on a poisson sampling assumption anyway, which makes their relevance questionable). Isn’t the right question to ask: is given a data-set what is the best assembly that one can get? This leads us back into algorithm design. (of course formally posing that problem can be daunting.)

]]>Whether the proof goes through with the paths relaxed to walks remains to be seen.

The key point about expressivity in this case is that we are willing to consider solutions that are without any seed to start the alignment, i.e. no matching k-mer between the graph and the read, and that the structure of the graph is such that there even if you start at a fixed k-mer (a seed) there are exponentially many paths close to this starting point to consider.

]]>Indexing Variation Graphs

J Sirén – arXiv preprint arXiv:1604.06605, 2016 – arxiv.org

So, no, it’s not Hamiltonian path vs Eulerian cycle problem, it’s just over-restricted formulation which brings NP-completeness, exactly what you mentioned – “The issue of expressiveness is useful because it tells us if we are not careful we can accidentally make our problems too hard”.

]]>