[Bioc-sig-seq] Memory bottlenecks while using ShortRead
Martin Morgan
mtmorgan at fhcrc.org
Sun Jun 20 04:52:10 CEST 2010
Hi Cei --
On 06/19/2010 02:34 PM, Cei Abreu-Goodger wrote:
> Hello all,
>
> I've been very pleased with what I can do with ShortRead, but I was
> wondering if I could significantly decrease the amount of memory I'm using.
For the specific problem below alphabetByCycle(quality(sr)) returns a
matrix of quality score x cycle, with entries the tally of the number of
reads with that score. So a column therefore represent the weights on
which quantiles can be calculated, I guess 'by hand' rather than using
'quantiles'.
More generally I'd like to implement something to process larger files.
The conversations I've had have involved 'streams' where the file
essentially gets processed in chunks. The user would have to provide a
function that processed the appropriate data structure, e.g., an
'AlignedRead' or 'ShortReadQ' object, and possibly also a function that
stitched the results from several chunks back together.
I don't think this solution is really the best possible. For instance
the user might want to do several things with their data (transform
reads, qa, down-stream analysis, ...) and then one is faced either with
performing a stream for each type of analysis, or creating a complicated
object after processing each chunk and then having to stitch this
complicated object back together. The streaming approach might also
require some more clever algorithms associated with operations that
depend non-trivially on the entire data set (e.g., choosing a random
subset), some of which might be provided by ShortRead, but others of
which might make analysis quite complicated.
So I'd be interested in hearing other ideas and use cases.
Martin
>
> For example, a simple QC script for Illumina fastq files:
>
>
> library(ShortRead)
>
> # Read in fastq file
> sr <- readFastq("fastqfile.fq.gz", withIds=FALSE)
>
> # obtain base frequency per cycle
> abc <- alphabetByCycle(sread(sr))
> abcFreq <- abc / length(sr)
> write.table(abcFreq, "abcFreq.txt")
> rm(abc,abcFreq)
> gc()
> # used (Mb) gc trigger (Mb) max used (Mb)
> #Ncells 1272586 68.0 1967602 105.1 1476915 78.9
> #Vcells 475890241 3630.8 880996994 6721.5 875431136 6679.1
>
>
> # obtain quality values per cycle
> numQuals <- as(quality(sr), "matrix")
> qualQ <- apply(numQuals,2,function(x) quantile(x,c(0.9,0.5,0.1)))
> write.table(qualQ, "qualQ.txt")
> rm(numQuals,qualQ)
> gc()
> # used (Mb) gc trigger (Mb) max used (Mb)
> #Ncells 1274699 68.1 1967602 105.1 1476915 78.9
> #Vcells 475890548 3630.8 1921228277 14657.9 2390924743 18241.4
>
>
>
> While this is very simple to write and modify, it comes at the expense
> of a larger and larger amounts of RAM for newer fastq files.
>
>
> Regarding ways of reducing this memory footprint:
>
> 1) I could of course split the fastq file into smaller sets and run the
> exact same code. I could store the base counts and at the end calculate
> frequency. What about the quality quantiles though? I could still store
> them, and then average them, and it would still be useful if not
> identical to the quantiles calculated on the full data.
>
> 2) Is there a more memory efficient way of calculating quantiles in R?
>
>
> Does anyone have a better suggestion? Of course, the above code was
> given as a simple example, but further processing, like removing adapter
> sequences, quality filtering, etc, can quickly increase the memory
> footprint for processing a full fastq file to ~30-100 GB.
>
> Many thanks,
>
> Cei
>
>
>
> My sessionInfo:
>
>
> R version 2.11.0 (2010-04-22)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices datasets utils methods base
>
> other attached packages:
> [1] ShortRead_1.6.2 Rsamtools_1.0.1 lattice_0.18-5
> [4] Biostrings_2.16.0 GenomicRanges_1.0.3 IRanges_1.6.2
> [7] Biobase_2.8.0
>
> loaded via a namespace (and not attached):
> [1] grid_2.11.0 hwriter_1.2
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list