[Bioc-sig-seq] understanding memory size of AlignedRead objects
Janet Young
jayoung at fhcrc.org
Sat May 14 01:35:29 CEST 2011
Hi again,
Thank you, Herve! That made perfect sense and was very helpful (it explains some issues I had a while ago too). I'm slowly learning, with help from you all.
I'll need to think a little more about how I monitor my memory usage now that I understand what you've told me about object.size and sharing.
I started getting into this whole question because I noticed I'd done something to clog up the memory and swap in a large-memory machine, so my first thought was to look at what big objects were around in my session. Are there any other tools I should look into to help me figure out where all my memory is being used? (I suspect there was something wrong in that particular session - I haven't had the same trouble since doing very similar things).
Have a good weekend,
Janet
On May 11, 2011, at 3:11 PM, Hervé Pagès wrote:
> Hi Janet,
>
> You might have found an issue with the "compact" method for AlignedRead
> objects (not sure, I'll look more into this), however I have a long
> story for you about why you might not even need to use compact().
>
> First thing to understand is that seeing a big "object size" is not
> necessarily a problem. This is because object.size() is not doing a
> very good job on objects that contain external pointers (e.g.
> DNAString, DNAStringSet) or environments.
>
> Here is an example:
>
> library(BSgenome.Celegans.UCSC.ce2)
> chrI <- Celegans$chrI
>
> Then:
>
> > chrI
> 15080483-letter "DNAString" instance
> seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
>
> > object.size(chrI)
> 15082800 bytes
>
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 1129166 60.4 1590760 85.0 1368491 73.1
> Vcells 2386831 18.3 2968425 22.7 2397912 18.3
>
> > object.size(list(chrI, chrI))
> 30165656 bytes
>
> > x <- rep.int(list(chrI), 100000)
>
> > object.size(x)
> 1508280800040 bytes
>
> Hey, how is this possible? I don't have that amount of memory on my
> laptop!
>
> Also:
>
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 1129383 60.4 1710298 91.4 1590760 85.0
> Vcells 2486879 19.0 3196846 24.4 2948256 22.5
>
> Clearly object.size() is not telling the truth.
>
> What happens is that, strictly speaking, a DNAString object doesn't
> "contain" the sequence data, but it contains a pointer to a memory
> location where the sequence data is stored. This allows "sharing"
> the sequence data by several objects. For example, other DNAString
> instances can point to the same data as chrI:
>
> > subject <- chrI
>
> > subject at shared
> SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> > chrI at shared
> SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> > exon <- subseq(chrI, start=1001, width=200)
>
> > exon
> 200-letter "DNAString" instance
> seq: TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACA...CAAAATTTTTGGAAATTAGTTTAAAAATCTCACATT
>
> > exon at shared
> SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> Note that even if 'exon' is only "looking" at a small portion of the
> shared data, it's still pointing to the same memory location (i.e. same
> starting address). The bookkeeping of what part of the whole shared data
> needs to be looked at exactly is handled via the "offset" and "length"
> slots:
>
> > str(chrI)
> Formal class 'DNAString' [package "Biostrings"] with 5 slots
> ..@ shared :Formal class 'SharedRaw' [package "IRanges"] with 2 slots
> .. .. ..@ xp :<externalptr>
> .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8>
> ..@ offset : int 0
> ..@ length : int 15080483
> ..@ elementMetadata: NULL
> ..@ metadata : list()
>
> > str(exon)
> Formal class 'DNAString' [package "Biostrings"] with 5 slots
> ..@ shared :Formal class 'SharedRaw' [package "IRanges"] with 2 slots
> .. .. ..@ xp :<externalptr>
> .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8>
> ..@ offset : int 1000
> ..@ length : int 200
> ..@ elementMetadata: NULL
> ..@ metadata : list()
>
> The sequence data can also be shared by other types of objects. For example:
>
> > rna <- RNAString(exon)
>
> > rna
> 200-letter "RNAString" instance
> seq: UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUACA...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU
>
> > rna at shared
> SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> > m <- matchPattern("UA", rna)
>
> > m
> Views on a 200-letter RNAString subject
> subject: UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUA...AAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU
> views:
> start end width
> [1] 24 25 2 [UA]
> [2] 29 30 2 [UA]
> [3] 33 34 2 [UA]
> [4] 151 152 2 [UA]
> [5] 154 155 2 [UA]
> [6] 181 182 2 [UA]
> [7] 186 187 2 [UA]
>
> > subject(m)@shared
> SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> > m[[4]]
> 2-letter "RNAString" instance
> seq: UA
>
> > m[[4]]@shared
> SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> > dnaset <- DNAStringSet(m)
>
> > dnaset
> A DNAStringSet instance of length 7
> width seq
> [1] 2 TA
> [2] 2 TA
> [3] 2 TA
> [4] 2 TA
> [5] 2 TA
> [6] 2 TA
> [7] 2 TA
>
> > dnaset at pool
> SharedRaw_Pool of length 1
> 1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> As you can see, all these operations happen without copying the original
> sequence data (chromosome I). Hence they are very fast, and, most of the
> times (but not always, more on this below) very memory efficient too,
> despite what's reported by object.size(). If the original sequence data
> needs to be modified, then it will be copied before modification:
>
> > subseq(rna, start=7, width=3) <- RNAString("AA")
>
> > rna
> 199-letter "RNAString" instance
> seq: UUUUUCAAUUUUUUGAAAUGAAUAUCGUAGCUACAG...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU
>
> > rna at shared
> SharedRaw of length 199 (data starting at address 0x33a3b58)
>
> Concatenation too will trigger a copy:
>
> > uaua <- c(m[[4]], m[[1]])
>
> > uaua
> 4-letter "RNAString" instance
> seq: UAUA
>
> > uaua at shared
> SharedRaw of length 4 (data starting at address 0x53a1120)
>
> For DNAStringSet objects (and XStringSet objects in general) the
> mechanism used for sharing sequence data is a little bit more
> complicated than for XString objects: instead of a single pointer,
> they use a list of pointers (stored in the "pool" slot) so the
> sequence data can be fragmented into several memory chunks (each
> chunk can be shared with other objects). For example:
>
> > dnaset2 <- c(DNAStringSet(rna), DNAStringSet(uaua))
>
> > dnaset2
> A DNAStringSet instance of length 2
> width seq
> [1] 199 TTTTTCAATTTTTTGAAATGAATATCGTAGCTAC...AATTTTTGGAAATTAGTTTAAAAATCTCACATT
> [2] 4 TATA
>
> > dnaset2 at pool
> SharedRaw_Pool of length 2
> 1: SharedRaw of length 199 (data starting at address 0x33a3b58)
> 2: SharedRaw of length 4 (data starting at address 0x53a1120)
>
> To summarize, all the XString, XStringSet, XStringViews, MaskedXString
> objects use SharedRaw objects internally to store the sequence data
> (the SharedRaw class is an internal class defined in IRanges).
> You can think of those SharedRaw objects as the basic storage units
> for storing sequence data i.e. each of them is a chunk of memory
> containing sequence data. It can be shared (i.e. more than 1 object
> is "using it") or not (only 1 object is "using it"). If no object
> is using it, then it's a candidate for garbbage collection.
>
> For example, at this point, the SharedRaw object containing
> chromosome I is used by the following objects: 'chrI', 'x',
> 'subject', 'exon', 'm' and 'dnaset'. So all of them would need
> to be removed (with rm()) for the SharedRaw object to be garbbage
> collected.
>
> Note that IRanges/Biostrings don't contain any code for handling
> this, it just relies on the standard behaviour of external pointers
> in R (a SharedRaw object is basically an R external pointer pointing
> to a raw vector) and how the garbbage collector handles them.
>
> OK, now back to your original question. When you subset your
> AlignedRead object, the "small" AlignedRead object is still
> pointing to the original sequence data:
>
> > aln0
> class: AlignedRead
> length: 1000 reads; width: 35 cycles
> chromosome: NM NM ... chr5.fa 29:255:255
> position: NA NA ... 71805980 NA
> strand: NA NA ... + NA
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
>
> > sread(aln0)@pool
> SharedRaw_Pool of length 1
> 1: SharedRaw of length 35000 (data starting at address 0x810f418)
>
> > sread(aln0[1:10])@pool
> SharedRaw_Pool of length 1
> 1: SharedRaw of length 35000 (data starting at address 0x810f418)
>
> Hence the strange result reported by object.size():
>
> > object.size(aln0)
> 175968 bytes
>
> > object.size(aln0[1:10])
> 92480 bytes
>
> 'aln0[1:10]' has a smaller size, but not as small as one would expect.
>
> Does it matter? It depends.
>
> If you want to save the small object to a file (serialization), then
> yes (the whole SharedRaw object will be saved, which is not good).
> In that case, you want to "compact" the object first. compact() was
> specifically introduced for this use case: it creates a "full" copy
> of the original object ("full" here means that even the sequence data
> are copied), but only the sequence data that are effectively "being
> looked at" are copied. The resulting object is semantically equivalent
> to the original but its internal representation and size are different:
>
> > dnaset
> A DNAStringSet instance of length 7
> width seq
> [1] 2 TA
> [2] 2 TA
> [3] 2 TA
> [4] 2 TA
> [5] 2 TA
> [6] 2 TA
> [7] 2 TA
>
> > dnaset at pool
> SharedRaw_Pool of length 1
> 1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)
>
> > object.size(dnaset)
> 15084424 bytes
>
> > compact(dnaset)
> A DNAStringSet instance of length 7
> width seq
> [1] 2 TA
> [2] 2 TA
> [3] 2 TA
> [4] 2 TA
> [5] 2 TA
> [6] 2 TA
> [7] 2 TA
>
> > compact(dnaset)@pool
> SharedRaw_Pool of length 1
> 1: SharedRaw of length 14 (data starting at address 0x80ed740)
>
> > object.size(compact(dnaset))
> 3952 bytes
>
> Note that, in this particular situation, compact() could take
> advantage of the fact that the elements in 'dnaset' are repeated
> (from a sequence content point of view, not from a memory location
> point of view) to achieve greater size reduction... but that's
> another story.
>
> Also if you work one chromosome at a time in a loop, and if the
> objects you create within the loop are pointing to the chromosome
> sequence, and if those references are going to persist after you
> are done with the chromosome, then yes, you want to compact those
> objects. By compacting them you remove the references to the
> chromosome sequence and therefore allow it to be garbbage collected.
> Otherwise you will end up with all the chromosome sequences loaded
> into memory!
>
> For example, you wouldn't want to do something like this:
>
> lapply(seqnames(Celegans),
> function(seqname)
> matchPattern("CAACTTAC", Celegans[[seqname]]))
>
> because the result would be a list of XStringViews objects, one
> per chromosome, each of them pointing to the chromosome sequence.
> That will work for Celegans but for Hsapiens you will end up with
> all the chromosomes in memory!
>
> For your use case, the benefit of using compact() on the subsetted
> AlignedRead objects depends on what you will do with it. If those
> small objects are just temporary (e.g. you will do coverage() on
> them, and you won't care about the subsetted AlignedRead objects
> anymore), then I would recommend that you do not use compact(): it
> will slow down things and you will end up using more memory than
> you need (because compact() creates a copy). But if you want to keep
> them around, but also want to be able to remove the big AlignedRead
> object in order to save memory, then you might want to compact them.
>
> One thing to remember is that object.size() should tell the truth
> on a compacted object. And also summing the individual object.size()
> for a collection of compacted objects tells the truth about the
> memory effectively used by the collection. But summing the
> individual object.size() in general does not tell the truth:
> it might overestimate the amount of memory effectively used,
> because it might count the memory used by a SharedRaw objects
> several times (as many times as the SharedRaw object is used).
>
> Hope the story was helpful and not too long ;-)
>
> H.
>
>
> On 11-05-10 05:16 PM, Janet Young wrote:
>> Hi again,
>>
>> I've been looking into this question a bit more, now using compact(). Things seemed to go well with the example dataset provided with ShortRead (as far as I can tell). But when I use our own much larger dataset the compaction factor is not very good (my small object is the first 100 reads, the full set is ~18 million reads, but the small object is 40% the size in memory of the full object). I think it looks like quality and sreads haven't really been compacted nearly as much as I would expect. The code is below.
>
>
>>
>> Martin - if you would like access to this dataset I can email you about that privately, but I'm guessing the same would happen with any big dataset?
>>
>> Janet
>>
>>
>>
>> library(ShortRead)
>>
>> ########## a big dataset (our data)
>>
>> read2Dir<- "solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>>
>> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
>> my_reads_verysmall<- my_reads[1:100]
>> my_reads_verysmall_compact<- compact(my_reads[1:100])
>>
>> length(my_reads)
>> # [1] 17894091
>> length(my_reads_verysmall)
>> # [1] 100
>> length(my_reads_verysmall_compact)
>> # [1] 100
>>
>> object.size(my_reads)
>> # 3190125528 bytes
>> object.size(my_reads_verysmall)
>> # 1753653496 bytes
>> object.size(my_reads_verysmall_compact)
>> # 1253663384 bytes
>>
>> as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
>> # [1] 0.549713
>>
>> as.numeric(object.size(my_reads_verysmall_compact)) / as.numeric(object.size(my_reads))
>> # [1] 0.3929825
>>
>> ##### where is most of that memory?
>> object.size(position(my_reads_verysmall_compact))
>> # 440 bytes
>> object.size(chromosome(my_reads_verysmall_compact))
>> # 3040 bytes
>> object.size(sread(my_reads_verysmall_compact))
>> # 626820976 bytes
>> object.size(quality(my_reads_verysmall_compact))
>> # 626821568 bytes
>>
>> myDNA2<- DNAStringSet(as.character(sread(my_reads_verysmall_compact)))
>> object.size(myDNA2)
>> # 9944 bytes
>>
>>
>> ###### and ShortRead's example dataset, 1000 reads
>>
>> exptPath<- system.file("extdata", package = "ShortRead")
>> sp<- SolexaPath(exptPath)
>> aln<- readAligned(sp, "s_2_export.txt")
>>
>> aln ## aln has 1000 reads
>> aln_small<- aln[1:2] ### aln 2 has 2 reads
>> aln_small_compact<- compact(aln[1:2])
>>
>> object.size(aln)
>> # 175968 bytes
>> object.size(aln_small)
>> # 91488 bytes
>> object.size(aln_small_compact)
>> # 21744 bytes
>>
>> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
>> # [1] 0.5199127
>>
>> as.numeric(object.size(aln_small_compact)) / as.numeric(object.size(aln))
>> # [1] 0.1235679
>>
>> myDNA<- DNAStringSet( as.character(sread(aln_small_compact)) )
>> identical(myDNA,sread(aln_small_compact))
>> [1] TRUE
>>
>> aln_medium_compact<- compact(aln[1:300])
>> myDNA_medium<- DNAStringSet( as.character(sread(aln_medium_compact)) )
>> identical(myDNA,sread(aln_medium_compact))
>> # [1] FALSE
>>
>> aln_10_compact<- compact(aln[1:10])
>> myDNA_10<- DNAStringSet( as.character(sread(aln_10_compact)) )
>> identical(myDNA,sread(aln_10_compact))
>> # [1] FALSE
>>
>> as.numeric(object.size(aln_10_compact)) / as.numeric(object.size(aln))
>> # [1] 0.1317512
>>
>> #########
>>
>>
>> sessionInfo()
>>
>> R version 2.13.0 (2011-04-13)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] ShortRead_1.10.0 Rsamtools_1.4.0 lattice_0.19-23
>> [4] Biostrings_2.20.0 GenomicRanges_1.4.0 IRanges_1.10.0
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.12.0 grid_2.13.0 hwriter_1.3
>>
>>
>>
>>
>> On May 10, 2011, at 4:25 PM, Martin Morgan wrote:
>>
>>> On 05/10/2011 02:47 PM, Janet Young wrote:
>>>> Hi, (probably hello to you, Martin)
>>>
>>> Hi (or is that hello?) Janet --
>>>
>>>> I'm looking at some Illumina seq data, and trying to be more rigorous
>>>> than I have been in the past about memory usage and tidying up unused
>>>> variables. I'm a little mystified by something - I wonder if you can
>>>> help me understand?
>>>>
>>>> I'm starting with a big AlignedRead object (one full lane of seq
>>>> data) and then I've been using [] on AlignedRead objects to take
>>>> various subsets of the data (and then looking at quality scores, map
>>>> positions, etc). I'm also taking some very small subsets (e.g. just
>>>> the first 100 reads) to test and optimize some functions I'm
>>>> writing.
>>>>
>>>> My confusion comes because even though I'm cutting down the number of
>>>> seq reads by a lot (e.g. from 18 million to just 100 reads), the new
>>>> AlignedRead object still takes up a lot of memory.
>>>
>>> XStringSet (including DNAStringSet and BStringSet, which are used to hold AlignedRead sequence and quality information) are actually 'views' (start and end coordinates) on an underlying XString; when you subset a DNAStringSet, you subset the view but not the underlying DNAString. This might sound bad, but actually you have just one instance of the DNAString, and two light-weight views (the full view, and the subset view). If you're done with the full DNAString, you can force it to be compacated with compact(), e.g., after your first example
>>>
>>>> object.size(s1)
>>> 50840 bytes
>>>> object.size(s1[1:10]) # mostly smaller 'view'
>>> 38984 bytes
>>>> object.size(s1[1]) # a little smaller 'view', same big DNAString
>>> 38864 bytes
>>>> object.size(compact(s1[1]))
>>> 3912 bytes
>>>
>>> or
>>>
>>>> object.size(aln)
>>> 175968 bytes
>>>> object.size(aln[1])
>>> 91488 bytes
>>>> object.size(compact(aln[1]))
>>> 21584 bytes
>>>
>>> (I think Herve is working on your weighted coverage question)
>>>
>>> Martin
>>>
>>>>
>>>> Two examples are given below - in both cases the small object takes
>>>> about half as much memory as the original, even though the number of
>>>> reads is now very much smaller.
>>>>
>>>> Do you have any suggestions as to how I might reduce the memory
>>>> footprint of the subsetted AlignedRead object? Is this an expected
>>>> behavior?
>>>>
>>>> thanks very much,
>>>>
>>>> Janet
>>>>
>>>>
>>>> library(ShortRead)
>>>>
>>>> exptPath<- system.file("extdata", package = "ShortRead") sp<-
>>>> SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")
>>>>
>>>> aln ## aln has 1000 reads aln_small<- aln[1:2] ### aln 2 has 2
>>>> reads
>>>>
>>>> object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes
>>>>
>>>> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
>>>> #### [1] 0.4978324
>>>>
>>>> read2Dir<-
>>>> "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>>>>
>>>>
>>> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
>>>> my_reads_verysmall<- my_reads[1:100]
>>>>
>>>> length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100
>>>>
>>>> object.size(my_reads) # 3190125528 bytes
>>>> object.size(my_reads_verysmall) # 1753653496 bytes
>>>>
>>>> as.numeric(object.size(my_reads_verysmall)) /
>>>> as.numeric(object.size(my_reads)) # [1] 0.549713
>>>>
>>>>
>>>>
>>>> sessionInfo()
>>>>
>>>> R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
>>>> (32-bit)
>>>>
>>>> locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages: [1] stats graphics grDevices utils
>>>> datasets methods base
>>>>
>>>> other attached packages: [1] ShortRead_1.10.0 Rsamtools_1.4.1
>>>> lattice_0.19-26 Biostrings_2.20.0 [5] GenomicRanges_1.4.3
>>>> IRanges_1.10.0
>>>>
>>>> loaded via a namespace (and not attached): [1] Biobase_2.12.1
>>>> grid_2.13.0 hwriter_1.3
>>>>
>>>> _______________________________________________ Bioc-sig-sequencing
>>>> mailing list Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>> --
>>> Computational Biology
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>
>>> Location: M1-B861
>>> Telephone: 206 667-2793
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list