[Bioc-sig-seq] Read big fastq files in chunks
Martin Morgan
mtmorgan at fhcrc.org
Tue Sep 27 14:33:19 CEST 2011
On 09/27/2011 02:39 AM, Anita Lerch wrote:
> Hi,
> I should filter a big fastq file (HiSeq ca.>30'000'000, which does not
> fit in memory) for low qualities e.g. to many N in the sequence.
>
> I plan to read the file in chunks, filter the ShortReads and write them
> into a new file.
> This is no problem for fasta files, see the example:
>
> filename<- "data/DmS2DRSC_RNA_seqs_subsample.fa"
> eof<- FALSE
> append<- FALSE
> cycle<- 1L
> while(!eof){
> chunk<- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec)
> nFilt<- nFilter(5)
> writeFasta(chunk[nFilt(chunk)],
> sprintf("%s/filtered_%s",
> dirname(filename), basename(filename)),
> append=append)
> append<- TRUE
> cycle<- cycle + 1
> if(length(chunk) == 0L)
> eof<- TRUE
> }
>
> If I try to do the same for fastq file e.g.
>
> filename<- "data/test_data.fq"
> eof<- FALSE
> mode<- "w"
> cycle<- 1L
> while(!eof){
> chunk<- readFastq(filename)
> nFilt<- nFilter(5)
> writeFastq(chunk[nFilt(chunk)],
> sprintf("%s/filtered_%s",
> dirname(filename), basename(filename)),
> mode=mode)
> mode<- "a"
> cycle<- cycle + 1
> if(length(chunk) == 0L)
> eof<- TRUE
> }
>
> I facing some problems when I read the ShortReads:
> - read.DNAStringSet ignores the quality sequence.
> - readFastq can't read in chunks
> - FastqSampler choose a random sample of ShortRead
>
> Of course I can do a workaround e.g writing my own fastq parser, but I
> prefer a clean solution or to solve the problem.
> So my questions, is there a clean solution I'm not aware of it?
> If I'm right and there is this problem, what is the plan of the
> community to solve this problem or is this already work in progress?
> Can I help to solve this problem e.g. rewriting the readFastq()
> function.
Hi Anita -- this needs to be added to ShortRead, and I can do so later
this week. But if you'd like to help, I would like to create a reference
class
.FastqStreamer <- setRefClass("FastqStreamer", <...>)
a constructor
FastqStreamer <- function(con, n=1e6..., verbose=FALSE)
and a method
yield,FastqStreamer-method
so that streaming through a file looks like
f <- FastqStreamer(con, n=1e6, verbose=FALSE)
while (length(srq <- yield(f))) {
## work
}
I think .FastqStreamer would share some infrastructure with
.FastqSampler, and would take the same approach as in
R/methods-Sampler.R -- treat the file at the R level as a connection
that is input with readBin() so that R's connection interface can be
used, do some R-level buffering, and translate the raw() to ShortReadQ
objects in C (like src/sampler.c:sampler_rec_parser). Probably there is
some code re-factoring and a little reference class hierarchy .FastqFile
as parent of .FastqStreamer and .FastqSampler.
Martin
>
> Thanks for the answer.
> Greetings
> Anita
>
>> sessionInfo()
> R Under development (unstable) (2011-09-20 r57033)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.11.37 latticeExtra_0.6-18 RColorBrewer_1.0-5 Rsamtools_1.5.65
> [5] lattice_0.19-33 Biostrings_2.21.9 GenomicRanges_1.5.47 IRanges_1.11.26
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.13.9 BiocInstaller_1.1.27 BSgenome_1.21.3 grid_2.14.0
> [5] hwriter_1.3 QuasR_0.1.1 RCurl_1.6-10 rtracklayer_1.13.13
> [9] tools_2.14.0 XML_3.4-3 zlibbioc_0.1.7
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list