I’m at HiTSeq14 and ISMB14 and just gave my KmerStream talk yesterday. The best thing about giving a talk is that people look at what you’ve been doing in new ways. So yesterday I was talking to Shaun Jackman about potential applications and I joked that you could analyze data while downloading it.
And then I thought about it for a bit, implemented it into
KmerStream. It’s now on the github repository under the online branch. It just adds an –online flag which will print an estimate of the k-mer statistics every 100K reads.
What you can do is then download data sets, tee the input and process it at the sama time. It is a bit of a hack, with the shell command, but just looking at it run is worth it.
curl -s http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz | tee frag_1.fastq.gz | gzcat -f | ./KmerStream -k 31 -o tmp --online /dev/fd/0
So what is happening is that
curl -s downloads the data and doesn’t print any progress reports and prints to stdout.
tee will take the stdout and save to the file,
gzcat takes the input and unzippes it (use
gzcat on mac and regular
zcat on linux). KmerStream doesn’t read from stdin, but you can always read from stdin through the special file
/dev/fd/0 so that fixes it.
Here is the output in all its glory, but it doesn’t do it justice compared to seeing it in action.
100K reads -- 1009K repeated, 4M distinct, 3M singletons, 6M total k-mers processed 200K reads -- 1M repeated, 7M distinct, 6M singletons, 13M total k-mers processed 300K reads -- 2M repeated, 11M distinct, 8M singletons, 20M total k-mers processed 400K reads -- 2M repeated, 13M distinct, 10M singletons, 26M total k-mers processed 500K reads -- 2M repeated, 15M distinct, 12M singletons, 33M total k-mers processed ...
It is also possible to break off once you’ve seen enough and keep what you’ve downloaded. If it is a gzipped file, it is a bit broken since the end is missing. However the data is still there and you would need to fix the fastq file by removing the broken read.