[Bioc-sig-seq] understanding memory size of AlignedRead objects

Sat May 14 10:19:45 CEST 2011

Hi Janet,

On 11-05-13 04:09 PM, Janet Young wrote:
> Hi Herve,
>
> Thanks for this, as well as your long email of yesterday.
>
> I'll try the code, but first a quick question:  I did update.packages, and got the new IRanges (1.10.2) and ShortRead (1.10.2) versions as below, but it looks like Biostrings that's online is still 2.20.0 - maybe there was a glitch in uploading?

Please try again tomorrow. Biostrings 2.20.1 failed to pass check today
because of a mistake in one of the examples I updated, sorry:

http://bioconductor.org/checkResults/2.8/bioc-LATEST/Biostrings/lamb1-checksrc.html

Hopefully the light will turn green on the report tomorrow and the
package will propagate to the public repo...

H.

>
> thanks,
>
> Janet
>
>
> On May 13, 2011, at 3:22 AM, Hervé Pagès wrote:
>
>> Hi Janet,
>>
>> I've committed a few changes to IRanges/Biostrings/ShortRead that
>> should address the problem you are experiencing with compact():
>>
>>   library(ShortRead)
>>   dirPath<- "Data/C1-36Firecrest/Bustard/GERALD"
>>   bigaln<- readAligned(dirPath, "big_s_2_export.txt", "SolexaExport")
>>
>>   length(bigaln)  # 16384000
>>   object.size(bigaln)  # 2457626736 bytes
>>
>>   ## Size overhead for any AlignedRead object:
>>   emptyaln<- bigaln[FALSE]
>>   ALNovh<- object.size(emptyaln)  # 20736 bytes
>>
>>   ## Bytes per read:
>>   (object.size(bigaln) - ALNovh) / length(bigaln)  # 150 bytes
>>
>>   ## Take the first 10000 elts:
>>   first10000<- bigaln[1:10000]
>>   first10000_compact<- compact(first10000)
>>
>>   ## Bytes per read:
>>   (object.size(first10000_compact) - ALNovh) / 10000  # 150 bytes
>>
>> Make sure you have IRanges>= 1.10.2, Biostrings>= 2.20.1 and
>> ShortRead>= 1.10.2 (should become available in release on
>> Friday around 11am Seattle time).
>>
>> Cheers,
>> H.
>>
>>> sessionInfo()
>> R version 2.13.0 (2011-04-13)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>> [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
>> [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] ShortRead_1.10.2    Rsamtools_1.4.1     lattice_0.19-26
>> [4] Biostrings_2.20.1   GenomicRanges_1.4.3 IRanges_1.10.3
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.12.1 grid_2.13.0    hwriter_1.3
>>
>>
>>
>> On 11-05-10 05:16 PM, Janet Young wrote:
>>> Hi again,
>>>
>>> I've been looking into this question a bit more, now using compact(). Things seemed to go well with the example dataset provided with ShortRead (as far as I can tell). But when I use our own much larger dataset the compaction factor is not very good (my small object is the first 100 reads, the full set is ~18 million reads, but the small object is 40% the size in memory of the full object).    I think it looks like quality and sreads haven't really been compacted nearly as much as I would expect. The code is below.
>>>
>>> Martin - if you would like access to this dataset I can email you about that privately, but I'm guessing the same would happen with any big dataset?
>>>
>>> Janet
>>>
>>>
>>>
>>> library(ShortRead)
>>>
>>> ########## a big dataset (our data)
>>>
>>> read2Dir<- "solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>>>
>>> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
>>> my_reads_verysmall<- my_reads[1:100]
>>> my_reads_verysmall_compact<- compact(my_reads[1:100])
>>>
>>> length(my_reads)
>>> # [1] 17894091
>>> length(my_reads_verysmall)
>>> # [1] 100
>>> length(my_reads_verysmall_compact)
>>> # [1] 100
>>>
>>> object.size(my_reads)
>>> # 3190125528 bytes
>>> object.size(my_reads_verysmall)
>>> # 1753653496 bytes
>>> object.size(my_reads_verysmall_compact)
>>> # 1253663384 bytes
>>>
>>> as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
>>> # [1] 0.549713
>>>
>>> as.numeric(object.size(my_reads_verysmall_compact)) / as.numeric(object.size(my_reads))
>>> # [1] 0.3929825
>>>
>>> ##### where is most of that memory?
>>> object.size(position(my_reads_verysmall_compact))
>>> # 440 bytes
>>> object.size(chromosome(my_reads_verysmall_compact))
>>> # 3040 bytes
>>> object.size(sread(my_reads_verysmall_compact))
>>> # 626820976 bytes
>>> object.size(quality(my_reads_verysmall_compact))
>>> # 626821568 bytes
>>>
>>> myDNA2<-  DNAStringSet(as.character(sread(my_reads_verysmall_compact)))
>>> object.size(myDNA2)
>>> # 9944 bytes
>>>
>>>
>>> ###### and ShortRead's example dataset, 1000 reads
>>>
>>> exptPath<- system.file("extdata", package = "ShortRead")
>>> sp<- SolexaPath(exptPath)
>>> aln<- readAligned(sp, "s_2_export.txt")
>>>
>>> aln  ## aln has 1000 reads
>>> aln_small<- aln[1:2]   ### aln 2 has 2 reads
>>> aln_small_compact<- compact(aln[1:2])
>>>
>>> object.size(aln)
>>> # 175968 bytes
>>> object.size(aln_small)
>>> # 91488 bytes
>>> object.size(aln_small_compact)
>>> # 21744 bytes
>>>
>>> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
>>> # [1] 0.5199127
>>>
>>> as.numeric(object.size(aln_small_compact)) / as.numeric(object.size(aln))
>>> # [1] 0.1235679
>>>
>>> myDNA<- DNAStringSet( as.character(sread(aln_small_compact)) )
>>> identical(myDNA,sread(aln_small_compact))
>>> [1] TRUE
>>>
>>> aln_medium_compact<- compact(aln[1:300])
>>> myDNA_medium<- DNAStringSet( as.character(sread(aln_medium_compact)) )
>>> identical(myDNA,sread(aln_medium_compact))
>>> # [1] FALSE
>>>
>>> aln_10_compact<- compact(aln[1:10])
>>> myDNA_10<- DNAStringSet( as.character(sread(aln_10_compact)) )
>>> identical(myDNA,sread(aln_10_compact))
>>> # [1] FALSE
>>>
>>> as.numeric(object.size(aln_10_compact)) / as.numeric(object.size(aln))
>>> # [1] 0.1317512
>>>
>>> #########
>>>
>>>
>>> sessionInfo()
>>>
>>> R version 2.13.0 (2011-04-13)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] ShortRead_1.10.0    Rsamtools_1.4.0     lattice_0.19-23
>>> [4] Biostrings_2.20.0   GenomicRanges_1.4.0 IRanges_1.10.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.12.0 grid_2.13.0    hwriter_1.3
>>>
>>>
>>>
>>>
>>> On May 10, 2011, at 4:25 PM, Martin Morgan wrote:
>>>
>>>> On 05/10/2011 02:47 PM, Janet Young wrote:
>>>>> Hi, (probably hello to you, Martin)
>>>>
>>>> Hi (or is that hello?) Janet --
>>>>
>>>>> I'm looking at some Illumina seq data, and trying to be more rigorous
>>>>> than I have been in the past about memory usage and tidying up unused
>>>>> variables. I'm a little mystified by something - I wonder if you can
>>>>> help me understand?
>>>>>
>>>>> I'm starting with a big AlignedRead object (one full lane of seq
>>>>> data) and then I've been using [] on AlignedRead objects to take
>>>>> various subsets of the data (and then looking at quality scores, map
>>>>> positions, etc).   I'm also taking some very small subsets (e.g. just
>>>>> the first 100 reads) to test and optimize some functions I'm
>>>>> writing.
>>>>>
>>>>> My confusion comes because even though I'm cutting down the number of
>>>>> seq reads by a lot (e.g. from 18 million to just 100 reads), the new
>>>>> AlignedRead object still takes up a lot of memory.
>>>>
>>>> XStringSet (including DNAStringSet and BStringSet, which are used to hold AlignedRead sequence and quality information) are actually 'views' (start and end coordinates) on an underlying XString; when you subset a DNAStringSet, you subset the view but not the underlying DNAString. This might sound bad, but actually you have just one instance of the DNAString, and two light-weight views (the full view, and the subset view). If you're done with the full DNAString, you can force it to be compacated with compact(), e.g., after your first example
>>>>
>>>>> object.size(s1)
>>>> 50840 bytes
>>>>> object.size(s1[1:10]) # mostly smaller 'view'
>>>> 38984 bytes
>>>>> object.size(s1[1]) # a little smaller 'view', same big DNAString
>>>> 38864 bytes
>>>>> object.size(compact(s1[1]))
>>>> 3912 bytes
>>>>
>>>> or
>>>>
>>>>> object.size(aln)
>>>> 175968 bytes
>>>>> object.size(aln[1])
>>>> 91488 bytes
>>>>> object.size(compact(aln[1]))
>>>> 21584 bytes
>>>>
>>>> (I think Herve is working on your weighted coverage question)
>>>>
>>>> Martin
>>>>
>>>>>
>>>>> Two examples are given below - in both cases the small object takes
>>>>> about half as much memory as the original, even though the number of
>>>>> reads is now very much smaller.
>>>>>
>>>>> Do you have any suggestions as to how I might reduce the memory
>>>>> footprint of the subsetted AlignedRead object?  Is this an expected
>>>>> behavior?
>>>>>
>>>>> thanks very much,
>>>>>
>>>>> Janet
>>>>>
>>>>>
>>>>> library(ShortRead)
>>>>>
>>>>> exptPath<- system.file("extdata", package = "ShortRead") sp<-
>>>>> SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")
>>>>>
>>>>> aln  ## aln has 1000 reads aln_small<- aln[1:2]   ### aln 2 has 2
>>>>> reads
>>>>>
>>>>> object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes
>>>>>
>>>>> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
>>>>> #### [1] 0.4978324
>>>>>
>>>>> read2Dir<-
>>>>> "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>>>>>
>>>>>
>>>> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
>>>>> my_reads_verysmall<- my_reads[1:100]
>>>>>
>>>>> length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100
>>>>>
>>>>> object.size(my_reads) # 3190125528 bytes
>>>>> object.size(my_reads_verysmall) # 1753653496 bytes
>>>>>
>>>>> as.numeric(object.size(my_reads_verysmall)) /
>>>>> as.numeric(object.size(my_reads)) # [1] 0.549713
>>>>>
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>>
>>>>> R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
>>>>> (32-bit)
>>>>>
>>>>> locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages: [1] stats     graphics  grDevices utils
>>>>> datasets  methods   base
>>>>>
>>>>> other attached packages: [1] ShortRead_1.10.0    Rsamtools_1.4.1
>>>>> lattice_0.19-26     Biostrings_2.20.0 [5] GenomicRanges_1.4.3
>>>>> IRanges_1.10.0
>>>>>
>>>>> loaded via a namespace (and not attached): [1] Biobase_2.12.1
>>>>> grid_2.13.0    hwriter_1.3
>>>>>
>>>>> _______________________________________________ Bioc-sig-sequencing
>>>>> mailing list Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>>
>>>> --
>>>> Computational Biology
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: M1-B861
>>>> Telephone: 206 667-2793
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319