[Bioc-sig-seq] Memory bottlenecks while using ShortRead

Sat Jun 19 14:34:08 CEST 2010

Hello all,

I've been very pleased with what I can do with ShortRead, but I was 
wondering if I could significantly decrease the amount of memory I'm using.

For example, a simple QC script for Illumina fastq files:

library(ShortRead)

# Read in fastq file
sr <- readFastq("fastqfile.fq.gz", withIds=FALSE)

# obtain base frequency per cycle
abc <- alphabetByCycle(sread(sr))
abcFreq <- abc / length(sr)
write.table(abcFreq, "abcFreq.txt")
rm(abc,abcFreq)
gc()
#            used   (Mb) gc trigger   (Mb)  max used   (Mb)
#Ncells   1272586   68.0    1967602  105.1   1476915   78.9
#Vcells 475890241 3630.8  880996994 6721.5 875431136 6679.1

# obtain quality values per cycle
numQuals <- as(quality(sr), "matrix")
qualQ <- apply(numQuals,2,function(x) quantile(x,c(0.9,0.5,0.1)))
write.table(qualQ, "qualQ.txt")
rm(numQuals,qualQ)
gc()
#            used   (Mb) gc trigger    (Mb)   max used    (Mb)
#Ncells   1274699   68.1    1967602   105.1    1476915    78.9
#Vcells 475890548 3630.8 1921228277 14657.9 2390924743 18241.4

While this is very simple to write and modify, it comes at the expense 
of a larger and larger amounts of RAM for newer fastq files.

Regarding ways of reducing this memory footprint:

1) I could of course split the fastq file into smaller sets and run the 
exact same code. I could store the base counts and at the end calculate 
frequency. What about the quality quantiles though? I could still store 
them, and then average them, and it would still be useful if not 
identical to the quantiles calculated on the full data.

2) Is there a more memory efficient way of calculating quantiles in R?

Does anyone have a better suggestion? Of course, the above code was 
given as a simple example, but further processing, like removing adapter 
sequences, quality filtering, etc, can quickly increase the memory 
footprint for processing a full fastq file to ~30-100 GB.

Many thanks,

Cei

My sessionInfo:

R version 2.11.0 (2010-04-22)
x86_64-unknown-linux-gnu

locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] ShortRead_1.6.2     Rsamtools_1.0.1     lattice_0.18-5
[4] Biostrings_2.16.0   GenomicRanges_1.0.3 IRanges_1.6.2
[7] Biobase_2.8.0

loaded via a namespace (and not attached):
[1] grid_2.11.0 hwriter_1.2