Dear assemblers, we need to talk … together

This post was co-authored by Michael R. Crusoe @biocrusoe and thanks to Shaun Jackman @sjackman for discussions and reading the draft.

tl;dr we need a common file format for contig graphs and other stuff too

The state of assembly software right now is essentially dominated by a few end-to-end solutions. These include SPAdes, ALLPATHS, ABySS, and SOAPdenovo (if your tool is not here, don’t take it the wrong way). This integration is good for consumers who want good methods that get them from raw fastq files to an assembled genome.

On the other hand this tight integration is not a good situation if you want to get into the assembly business. Here there are two scenarios which make entry into the field difficult. First if you have some good ideas about how to improve some part of the assembly process, you have two choices: a) write an assembler from scratch, b) try to modify the some of the programs mentioned before or work your way around them. The second scenario arises when you have a solution along the lines of “oh, I’ll just do an assembly with some minor modifications and the tweak the output a bit”.

The solution to these problems is to break up the “assembly” problem to a few well defined pieces. The analogy we should be thinking about is what is done in variant calling. In this case the process goes from raw reads to aligned reads in BAM files, then every variant caller reads the BAM file and spits out a VCF (say what you will about VCF, but at least it is a file format). This is exactly what the field of assembly needs right now. A few file formats that capture the common tasks that every assembler goes through.

The common tasks are

  1. Create a graph (de Bruijn/overlap/contig/string … )
  2. Simplify the graph
  3. Map reads to graphs
  4. Produce an assembly

Some of the assemblers do these steps with several programs, SOAPdenovo and ABySS for instance have a separate program for each part. Modularity at the software level is great, but it doesn’t allow practitioners to mix and match across different software platforms.

The best way to open up the field is to embrace the Unix mantra that everything is a file. A corollary of this is that the best API is the file. And this is the key point here. How you do 1,2,3, and 4 shouldn’t matter and shouldn’t be standardized, but rather the intermediate files between 1 and 2, etc. Right now we have good file formats for before point 1, namely FASTQ files, and after 4, either FASTA or FASTG.

What is missing is a file format and an agreement across the field to use those for the intermediate results. Many of the programs have these intermediate files, but they are often tied to a particular representation.

We need a common file format for the contig graph and read alignment to a contig graph. The graph format would take care of 1-2 and 2-3, since the output is a graph in both cases. Similarly the mapping of reads would allow you to go from 3 to 4. Once we have common file formats we would be free to choose whatever program for performing any of these steps.

Perhaps you would want to write a new algorithm for resolving repeats using next-next-next-generation sequencing you could take as input a contig graph from the old days and combine that with your new 74.5% error rate long-long read stuff. End users could then use your method with any assembler they think is doing best at the moment.

Let’s take a more concrete example and say you want to build an RNA-seq assembler, first step is to take reads and generate contigs. All you would have to do is pick a contig assembler and tell it to not worry about aggressive error correction, assumptions about uniform coverage and all sorts of fancy stuff and just give you the graph, you can take it over from there.

This brings me to the next issue which is graph read alignment. This entails taking reads and mapping them against the graph. Instead of mapping to a linear sequence like a chromosome in a well defined assembly, or just a contig/scaffold, we are looking for an alignment to a path in the graph. Of course paths in the graph should correspond to a contiguous stretch of the underlying genome or transcriptome. The alignment will necessarily be fragmented when you have reads that can span multiple contigs.

A tool that can take as input a graph and raw reads and map the reads to the graph would be very valuable. This is currently something people are working on, both within and outside assemblers, but what is missing is the interoperability in terms of (again) the graph file format and more importantly the alignment format.

So what is the solution? I have no idea.

Ok, that’s not right, I have a few ideas, but I’m willing to change my mind.

First things first, the graph file format. There is already a graph file format for sequences, FASTG. But I have a few issues. It was meant as a format for representing variability in the final output of an assembly, e.g. SNPs, indels, tandem repeats etc, that doesn’t really break up a contig. Both SPAdes and ALLPATHS output FASTG files, but from what I can tell they don’t use it for intermediate results. Also there is no notion of coverage in FASTG, but you can put that as a property of the edge. Also, the graph stores sequences on the edges instead of nodes. I don’t like it, but if you can convince me I’m wrong I’ll reverse my position.

But unfortunately there is very little library support for FASTG. Writing a full parser for it is not impossible, but if you look at the standard you’ll see that there’s plenty of work to do. One idea would be to restrict the contig graph to a smaller subset of the FASTG standard. Essentially this would mean no funny business in the sequence (read the standard and you’ll be amazed at what is possible), so that is still a valid FASTA file (sort of), and encode the coverage and other information in the comment. So a simple example would be

>A:B:cov=10;
GATTA
>B:C':cov=9;
AC
>C::cov=19;
T

Another way to represent the graph is the way ABySS does it. Hat tip to Shaun Jackman for telling me about this (but he works on ABySS and is a part of the establishment, so take this with a grain of salt). The idea is to keep contigs in a FASTA file and have a separate graph file in the Graphviz dot file format, you can see the nitty gritty details. The key here is to use a subset of the dot file format popularized by Graphviz. This buys you a whole lot of library wonderfulness and plenty of programs that read the format. Essentially you keep track of edges of the graph by listing them in a bidirectional format so “A+” ->  “B+” is forward strand of A is followed by forward strand of B, whereas “B+” -> “C-” is forward strand of B is followed by reverse complement of C. If you don’t know why the bidirectional stuff is needed you have to read Gene Myers’ String Graph paper. We’ll wait.

What is really cute is that you can specify properties of the edges like overlap, which allow you to store more context in sequences for the contigs. If the above example had been formed with a de Bruijn graph with k=4, the resulting contigs might have been

>A
GATTA
>B
TTAC
>C
TGTA

and the dot file would be

digraph g {
A+ -> B+ [overlap=3]
B+ -> C- [overlap=3]
}

The advantage of this way is that both dot files and fasta files are standard with libraries that can parse this information. Also since the sequence and the graph are separated it allows you to work with the graph directly without loading the sequence. For instance if you would want to trim down the graph or only work with a particular subgraph you would only need to load the graph file, do your graph theory business and then work with the small subgraph. If you need to load a subset of the sequences you can have your FASTA files indexed and load only the sequences that you need.

I’ll leave the issue of graph alignment and how to represent the output until later. For now I just wanted to throw it out there because I think this is something that needs to happen. The implementation for outputting the graph files is not a lot of work if you have an assembly program, what’s keeping them back is that there haven’t been any requests for doing it and no common file format.

Let’s talk about this, either in comments, blog posts or on twitter. I really don’t care which format we use, as long as we can have a common one.

 

Additional thoughts:

This has been preached by the small tools manifesto for a wider class of problems.
Earlier this year there was discussion about similar issues in the context of scaffolders.

 

late update (july 22): Heng Li has a new proposal on his blog on a file format, the discussion in the comments is quite active.

Advertisements

Analyzing data while downloading

I’m at HiTSeq14 and ISMB14 and just gave my KmerStream talk yesterday. The best thing about giving a talk is that people look at what you’ve been doing in new ways. So yesterday I was talking to Shaun Jackman about potential applications and I joked that you could analyze data while downloading it.

And then I thought about it for a bit, implemented it into KmerStream. It’s now on the github repository under the online branch. It just adds an –online flag which will print an estimate of the k-mer statistics every 100K reads.

What you can do is then download data sets, tee the input and process it at the sama time. It is a bit of a hack, with the shell command, but just looking at it run is worth it.

curl -s  http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz | tee frag_1.fastq.gz |  gzcat -f | ./KmerStream -k 31 -o tmp --online /dev/fd/0

So what is happening is that curl -s downloads the data and doesn’t print any progress reports and prints to stdout. tee will take the stdout and save to the file, gzcat takes the input and unzippes it (use gzcat on mac and regular zcat on linux). KmerStream doesn’t read from stdin, but you can always read from stdin through the special file /dev/fd/0 so that fixes it.

Here is the output in all its glory, but it doesn’t do it justice compared to seeing it in action.

100K reads  -- 1009K repeated, 4M distinct, 3M singletons, 6M total k-mers processed
200K reads  -- 1M repeated, 7M distinct, 6M singletons, 13M total k-mers processed
300K reads  -- 2M repeated, 11M distinct, 8M singletons, 20M total k-mers processed
400K reads  -- 2M repeated, 13M distinct, 10M singletons, 26M total k-mers processed
500K reads  -- 2M repeated, 15M distinct, 12M singletons, 33M total k-mers processed
...

It is also possible to break off once you’ve seen enough and keep what you’ve downloaded. If it is  a gzipped file, it is a bit broken since the end is missing. However the data is still there and you would need to fix the fastq file by removing the broken read.