Analyzing data while downloading

I’m at HiTSeq14 and ISMB14 and just gave my KmerStream talk yesterday. The best thing about giving a talk is that people look at what you’ve been doing in new ways. So yesterday I was talking to Shaun Jackman about potential applications and I joked that you could analyze data while downloading it.

And then I thought about it for a bit, implemented it into KmerStream. It’s now on the github repository under the online branch. It just adds an –online flag which will print an estimate of the k-mer statistics every 100K reads.

What you can do is then download data sets, tee the input and process it at the sama time. It is a bit of a hack, with the shell command, but just looking at it run is worth it.

curl -s  http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz | tee frag_1.fastq.gz |  gzcat -f | ./KmerStream -k 31 -o tmp --online /dev/fd/0

So what is happening is that curl -s downloads the data and doesn’t print any progress reports and prints to stdout. tee will take the stdout and save to the file, gzcat takes the input and unzippes it (use gzcat on mac and regular zcat on linux). KmerStream doesn’t read from stdin, but you can always read from stdin through the special file /dev/fd/0 so that fixes it.

Here is the output in all its glory, but it doesn’t do it justice compared to seeing it in action.

100K reads  -- 1009K repeated, 4M distinct, 3M singletons, 6M total k-mers processed
200K reads  -- 1M repeated, 7M distinct, 6M singletons, 13M total k-mers processed
300K reads  -- 2M repeated, 11M distinct, 8M singletons, 20M total k-mers processed
400K reads  -- 2M repeated, 13M distinct, 10M singletons, 26M total k-mers processed
500K reads  -- 2M repeated, 15M distinct, 12M singletons, 33M total k-mers processed
...

It is also possible to break off once you’ve seen enough and keep what you’ve downloaded. If it is  a gzipped file, it is a bit broken since the end is missing. However the data is still there and you would need to fix the fastq file by removing the broken read.

Advertisements

One thought on “Analyzing data while downloading

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s