Dear assemblers, we need to talk … together

This post was co-authored by Michael R. Crusoe @biocrusoe and thanks to Shaun Jackman @sjackman for discussions and reading the draft.

tl;dr we need a common file format for contig graphs and other stuff too

The state of assembly software right now is essentially dominated by a few end-to-end solutions. These include SPAdes, ALLPATHS, ABySS, and SOAPdenovo (if your tool is not here, don’t take it the wrong way). This integration is good for consumers who want good methods that get them from raw fastq files to an assembled genome.

On the other hand this tight integration is not a good situation if you want to get into the assembly business. Here there are two scenarios which make entry into the field difficult. First if you have some good ideas about how to improve some part of the assembly process, you have two choices: a) write an assembler from scratch, b) try to modify the some of the programs mentioned before or work your way around them. The second scenario arises when you have a solution along the lines of “oh, I’ll just do an assembly with some minor modifications and the tweak the output a bit”.

The solution to these problems is to break up the “assembly” problem to a few well defined pieces. The analogy we should be thinking about is what is done in variant calling. In this case the process goes from raw reads to aligned reads in BAM files, then every variant caller reads the BAM file and spits out a VCF (say what you will about VCF, but at least it is a file format). This is exactly what the field of assembly needs right now. A few file formats that capture the common tasks that every assembler goes through.

The common tasks are

  1. Create a graph (de Bruijn/overlap/contig/string … )
  2. Simplify the graph
  3. Map reads to graphs
  4. Produce an assembly

Some of the assemblers do these steps with several programs, SOAPdenovo and ABySS for instance have a separate program for each part. Modularity at the software level is great, but it doesn’t allow practitioners to mix and match across different software platforms.

The best way to open up the field is to embrace the Unix mantra that everything is a file. A corollary of this is that the best API is the file. And this is the key point here. How you do 1,2,3, and 4 shouldn’t matter and shouldn’t be standardized, but rather the intermediate files between 1 and 2, etc. Right now we have good file formats for before point 1, namely FASTQ files, and after 4, either FASTA or FASTG.

What is missing is a file format and an agreement across the field to use those for the intermediate results. Many of the programs have these intermediate files, but they are often tied to a particular representation.

We need a common file format for the contig graph and read alignment to a contig graph. The graph format would take care of 1-2 and 2-3, since the output is a graph in both cases. Similarly the mapping of reads would allow you to go from 3 to 4. Once we have common file formats we would be free to choose whatever program for performing any of these steps.

Perhaps you would want to write a new algorithm for resolving repeats using next-next-next-generation sequencing you could take as input a contig graph from the old days and combine that with your new 74.5% error rate long-long read stuff. End users could then use your method with any assembler they think is doing best at the moment.

Let’s take a more concrete example and say you want to build an RNA-seq assembler, first step is to take reads and generate contigs. All you would have to do is pick a contig assembler and tell it to not worry about aggressive error correction, assumptions about uniform coverage and all sorts of fancy stuff and just give you the graph, you can take it over from there.

This brings me to the next issue which is graph read alignment. This entails taking reads and mapping them against the graph. Instead of mapping to a linear sequence like a chromosome in a well defined assembly, or just a contig/scaffold, we are looking for an alignment to a path in the graph. Of course paths in the graph should correspond to a contiguous stretch of the underlying genome or transcriptome. The alignment will necessarily be fragmented when you have reads that can span multiple contigs.

A tool that can take as input a graph and raw reads and map the reads to the graph would be very valuable. This is currently something people are working on, both within and outside assemblers, but what is missing is the interoperability in terms of (again) the graph file format and more importantly the alignment format.

So what is the solution? I have no idea.

Ok, that’s not right, I have a few ideas, but I’m willing to change my mind.

First things first, the graph file format. There is already a graph file format for sequences, FASTG. But I have a few issues. It was meant as a format for representing variability in the final output of an assembly, e.g. SNPs, indels, tandem repeats etc, that doesn’t really break up a contig. Both SPAdes and ALLPATHS output FASTG files, but from what I can tell they don’t use it for intermediate results. Also there is no notion of coverage in FASTG, but you can put that as a property of the edge. Also, the graph stores sequences on the edges instead of nodes. I don’t like it, but if you can convince me I’m wrong I’ll reverse my position.

But unfortunately there is very little library support for FASTG. Writing a full parser for it is not impossible, but if you look at the standard you’ll see that there’s plenty of work to do. One idea would be to restrict the contig graph to a smaller subset of the FASTG standard. Essentially this would mean no funny business in the sequence (read the standard and you’ll be amazed at what is possible), so that is still a valid FASTA file (sort of), and encode the coverage and other information in the comment. So a simple example would be

>A:B:cov=10;
GATTA
>B:C':cov=9;
AC
>C::cov=19;
T

Another way to represent the graph is the way ABySS does it. Hat tip to Shaun Jackman for telling me about this (but he works on ABySS and is a part of the establishment, so take this with a grain of salt). The idea is to keep contigs in a FASTA file and have a separate graph file in the Graphviz dot file format, you can see the nitty gritty details. The key here is to use a subset of the dot file format popularized by Graphviz. This buys you a whole lot of library wonderfulness and plenty of programs that read the format. Essentially you keep track of edges of the graph by listing them in a bidirectional format so “A+” ->  “B+” is forward strand of A is followed by forward strand of B, whereas “B+” -> “C-” is forward strand of B is followed by reverse complement of C. If you don’t know why the bidirectional stuff is needed you have to read Gene Myers’ String Graph paper. We’ll wait.

What is really cute is that you can specify properties of the edges like overlap, which allow you to store more context in sequences for the contigs. If the above example had been formed with a de Bruijn graph with k=4, the resulting contigs might have been

>A
GATTA
>B
TTAC
>C
TGTA

and the dot file would be

digraph g {
A+ -> B+ [overlap=3]
B+ -> C- [overlap=3]
}

The advantage of this way is that both dot files and fasta files are standard with libraries that can parse this information. Also since the sequence and the graph are separated it allows you to work with the graph directly without loading the sequence. For instance if you would want to trim down the graph or only work with a particular subgraph you would only need to load the graph file, do your graph theory business and then work with the small subgraph. If you need to load a subset of the sequences you can have your FASTA files indexed and load only the sequences that you need.

I’ll leave the issue of graph alignment and how to represent the output until later. For now I just wanted to throw it out there because I think this is something that needs to happen. The implementation for outputting the graph files is not a lot of work if you have an assembly program, what’s keeping them back is that there haven’t been any requests for doing it and no common file format.

Let’s talk about this, either in comments, blog posts or on twitter. I really don’t care which format we use, as long as we can have a common one.

 

Additional thoughts:

This has been preached by the small tools manifesto for a wider class of problems.
Earlier this year there was discussion about similar issues in the context of scaffolders.

 

late update (july 22): Heng Li has a new proposal on his blog on a file format, the discussion in the comments is quite active.

7 thoughts on “Dear assemblers, we need to talk … together

  1. Thanks for this and previous posts on de Bruijn graphs / assembly. I am thinking about how to customize an assembler to output the data I need. Your discussion is helpful.

  2. Great blog post.

    For 1->2, 2->3 and after 4, FASTG would be adequate (it’s a format for any sequence graph, whether simplified or not), and is the current best candidate, isn’t it?

    For the missing step 3->4. I would like to say ‘FASTG’ again but I’m not sure if/how to encode read mapping information in FASTG (i.e. a path with annotations), does anyone know if it’s even possible in that format?

    Although I agree that storing sequences in edges (instead of, say, overlap information) feels also unnatural to me. Could anyone who’ve already successfully implemented FASTG output in their assembler (e.g. Allpaths, Spades) comment on that?

  3. SPAdes outputs FASTG from the set of paths which is the result of the repeat resolution via ExSPAnder. Unfortunately, FASTG itself is not flexible enough to be an intermediate format in SPAdes case. What we’re having now (pretty simplified):

    1. Average k-mer coverage for each edge
    2. Flanking coverage around the vertices
    3. Mapping k-mer => edge stored as a perfect hash map
    4. Various k-mer maps which are populated during simplification (use of them helps a lot during read alignment step)

    At various stages we’re also having bunch of other information including the estimated distances between edges, how the long reads traverse the graph, etc. Unfortunately, the assembly is not a simple 1-2-3-4 procedure as it was outlined above… 😦 in SPAdes world they use heavily the aux information from prev. stages to speed-up things and to improve the precision.

    I must note, though, that the stuff in SPAdes is made into separate “stages” internally and there is intermediate format there, so one can surely write his / her own repeat resolver on top of current SPAdes simplification routines. We would be happy to help if someone will ask for help here.

  4. I couldn’t agree more with this article. One of the benefits of AMOS was the shared file formats for assembler development. It allowed developers to test new modules (overlappers, layout, etc.) without having to reinvent the wheel. However, nowadays the biggest issue is that the AMOS formats are a bit outdated/inefficient.

  5. Rayan: fastg is definitely a viable option, although the more I think about it I like the idea of using dot and fasta.

    For the read alignment (and we’ll get to this later) it would be best to have a format that is searchable, like sorted bam files are. It is not only meant to capture the information that is there, but also make it easier for others to use it.

  6. Anton: Thanks for the insight into the spades pipeline.

    Of course it will be more efficient to keep track of all the information you might need to speed up your algorithm in later stages. I think that is perfectly valid and I’m not advocating for this to be the only intermediate format, but rather for users to have the option of asking for intermediate results. Even if it is less efficient to generate the files and they lose some resolution, it is still beneficial to the community.

    It might even be possible to have the intermediate output be split into common parts that everyone can use and implementation specific things that the pipeline will use. For example you mention the k-mer map for alignment, this is tied to a specific implementation, whereas the contig graph is something more implementation independent.

    I’m not arguing that all the intermediate results have to be in one file, but it would be great if the graph itself was there so that others could plug into that information.

  7. I fully agree with the needs described here. There was an attempt a couple of years ago (by Daniel Zerbino, myself, David Jaffe, Richard Durbin, and others) to design a file format(s) to exchange intermediate results. There was even a draft file format called Sequence Graph Format, from Richard Durbin, which was in fairly advanced stages, I thought. At some point, the attempt stalled, at least from what I could tell. I would be very glad to get involved again.

Leave a reply to Anton Korobeynikov Cancel reply