[Bioc-sig-seq] understanding memory size of AlignedRead objects
Janet Young
jayoung at fhcrc.org
Tue May 10 23:47:01 CEST 2011
Hi, (probably hello to you, Martin)
I'm looking at some Illumina seq data, and trying to be more rigorous than I have been in the past about memory usage and tidying up unused variables. I'm a little mystified by something - I wonder if you can help me understand?
I'm starting with a big AlignedRead object (one full lane of seq data) and then I've been using [] on AlignedRead objects to take various subsets of the data (and then looking at quality scores, map positions, etc). I'm also taking some very small subsets (e.g. just the first 100 reads) to test and optimize some functions I'm writing.
My confusion comes because even though I'm cutting down the number of seq reads by a lot (e.g. from 18 million to just 100 reads), the new AlignedRead object still takes up a lot of memory.
Two examples are given below - in both cases the small object takes about half as much memory as the original, even though the number of reads is now very much smaller.
Do you have any suggestions as to how I might reduce the memory footprint of the subsetted AlignedRead object? Is this an expected behavior?
thanks very much,
Janet
library(ShortRead)
exptPath <- system.file("extdata", package = "ShortRead")
sp <- SolexaPath(exptPath)
aln <- readAligned(sp, "s_2_export.txt")
aln ## aln has 1000 reads
aln_small <- aln[1:2] ### aln 2 has 2 reads
object.size(aln)
# 165156 bytes
object.size(aln_small)
# 82220 bytes
as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
#### [1] 0.4978324
read2Dir <- "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
my_reads <- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
my_reads_verysmall <- my_reads[1:100]
length(my_reads)
# [1] 17894091
length(my_reads_verysmall)
# [1] 100
object.size(my_reads)
# 3190125528 bytes
object.size(my_reads_verysmall)
# 1753653496 bytes
as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
# [1] 0.549713
sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.10.0 Rsamtools_1.4.1 lattice_0.19-26 Biostrings_2.20.0
[5] GenomicRanges_1.4.3 IRanges_1.10.0
loaded via a namespace (and not attached):
[1] Biobase_2.12.1 grid_2.13.0 hwriter_1.3
More information about the Bioc-sig-sequencing
mailing list