[Bioc-sig-seq] Memory bottlenecks while using ShortRead
Cei Abreu-Goodger
cei at ebi.ac.uk
Sat Jun 19 14:34:08 CEST 2010
Hello all,
I've been very pleased with what I can do with ShortRead, but I was
wondering if I could significantly decrease the amount of memory I'm using.
For example, a simple QC script for Illumina fastq files:
library(ShortRead)
# Read in fastq file
sr <- readFastq("fastqfile.fq.gz", withIds=FALSE)
# obtain base frequency per cycle
abc <- alphabetByCycle(sread(sr))
abcFreq <- abc / length(sr)
write.table(abcFreq, "abcFreq.txt")
rm(abc,abcFreq)
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 1272586 68.0 1967602 105.1 1476915 78.9
#Vcells 475890241 3630.8 880996994 6721.5 875431136 6679.1
# obtain quality values per cycle
numQuals <- as(quality(sr), "matrix")
qualQ <- apply(numQuals,2,function(x) quantile(x,c(0.9,0.5,0.1)))
write.table(qualQ, "qualQ.txt")
rm(numQuals,qualQ)
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
#Ncells 1274699 68.1 1967602 105.1 1476915 78.9
#Vcells 475890548 3630.8 1921228277 14657.9 2390924743 18241.4
While this is very simple to write and modify, it comes at the expense
of a larger and larger amounts of RAM for newer fastq files.
Regarding ways of reducing this memory footprint:
1) I could of course split the fastq file into smaller sets and run the
exact same code. I could store the base counts and at the end calculate
frequency. What about the quality quantiles though? I could still store
them, and then average them, and it would still be useful if not
identical to the quantiles calculated on the full data.
2) Is there a more memory efficient way of calculating quantiles in R?
Does anyone have a better suggestion? Of course, the above code was
given as a simple example, but further processing, like removing adapter
sequences, quality filtering, etc, can quickly increase the memory
footprint for processing a full fastq file to ~30-100 GB.
Many thanks,
Cei
My sessionInfo:
R version 2.11.0 (2010-04-22)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] ShortRead_1.6.2 Rsamtools_1.0.1 lattice_0.18-5
[4] Biostrings_2.16.0 GenomicRanges_1.0.3 IRanges_1.6.2
[7] Biobase_2.8.0
loaded via a namespace (and not attached):
[1] grid_2.11.0 hwriter_1.2
More information about the Bioc-sig-sequencing
mailing list