[Bioc-sig-seq] Consolidate AlignedRead objects

Martin Morgan mtmorgan at fhcrc.org
Fri Aug 28 18:58:47 CEST 2009


Hi Ivan --

Ivan Gregoretti wrote:
> Hello Martin and Everybody,
> 
> I tried your suggestion and it works nicely when the number of reads
> is not so big.
> 
> Successful example:
> 
> if I have three instances, aln000, aln0550 and aln100 like this
> 
>> aln000
> class: AlignedRead
> length: 9465484 reads; width: 36 cycles
> chromosome: chr11.fa chr13.fa ... chr6.fa chr6.fa
> position: 100667123 52735524 ... 121341376 25134423
> strand: + + ... + +
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
>> aln050
> class: AlignedRead
> length: 8918057 reads; width: 36 cycles
> chromosome: chr5.fa chr15.fa ... chr16.fa chr8.fa
> position: 149155914 57872637 ... 95751778 36611628
> strand: + + ... + +
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
>> aln100
> class: AlignedRead
> length: 11261186 reads; width: 36 cycles
> chromosome: chr4.fa chr5.fa ... chr10.fa chr1.fa
> position: 66224960 140647218 ... 69579797 16009268
> strand: + + ... + +
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
> 
> In can successfully apply the consolidating function:
> 
> 
>> superDuperConsolidator <- function(...) Reduce(append, list(...))
>> aln_000_100 <- superDuperConsolidator(aln000, aln050, aln100)
> 
>> aln_000_100
> class: AlignedRead
> length: 29644727 reads; width: 36 cycles
> chromosome: chr11.fa chr13.fa ... chr10.fa chr1.fa
> position: 100667123 52735524 ... 69579797 16009268
> strand: + + ... + +
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
> 
> 
> Not successful example:
> 
> Now I try to consolidate AlignedRead instances that are twice as big
> 
>> aln000
> class: AlignedRead
> length: 21845985 reads; width: 36 cycles
> chromosome: chr17.fa chr1.fa ... chr18.fa chr9.fa
> position: 41890422 142562489 ... 57003322 108499164
> strand: - - ... - +
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
>> aln050
> class: AlignedRead
> length: 21961352 reads; width: 36 cycles
> chromosome: chr18.fa chr16.fa ... chr15.fa chr9.fa
> position: 88900833 22029306 ... 102993167 83200074
> strand: - - ... + -
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
>> aln100
> class: AlignedRead
> length: 20865366 reads; width: 36 cycles
> chromosome: chr1.fa chr12.fa ... chr15.fa chr9.fa
> position: 99986382 14243887 ... 93339870 75136974
> strand: + - ... - +
> alignQuality: NumericQuality
> alignData varLabels: run lane ... filtering contig
> 
>> superDuperConsolidator <- function(...) Reduce(append, list(...))
>> aln_000_100 <- superDuperConsolidator(aln000, aln050, aln100)
> Error in .local(.Object, ...) :
>   'length' must be a single non-negative integer

this is really an internal integer overflow...

> In addition: Warning message:
> In width1 + width2 : NAs produced by integer overflow
> 
> I tried that with two different data sets; both failed. So, it is not
> the data itself but the amount of data, I believe. The append()
> function also fails when trying to consolidate two AlignedRead
> instances, 50 million tags each.
> 
> Do you thing that I have reached a limit or is there a way to "grow"
> AlignedRead instances slowly and gently?

You've hit a limit in the current implementation of DNAStringSet, where
the letters in the string set are stored in one long vector, and the
vector total length must be a little less than 2^31. There are some
ideas about how to address this on our end, but it won't be an overnight
solution.

The path forward is to either continue managing the various aln* objects
separately, or to strip information other than sread(), quality(), and
id() from the AlignedRead objects, and store this other information in
relevant data structures, e.g., IRanges or DataFrame from the IRanges
library.

Martin

> By the way, I am using a server with very large memory now. So, memory
> efficiency is far less important than successful consolidation.
> sessionInfo() is the same.
> 
> Thank you,
> 
> Ivan
> 
> 
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
> 
> 
> 
> On Thu, Aug 27, 2009 at 6:45 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>> Hi Ivan --
>>
>> Ivan Gregoretti wrote:
>>> Hello everybody,
>>>
>>> Is there any memory efficient way to consolidate multiple AlignedRead
>>> objects into one?
>>>
>>>
>>> Example:
>>>
>>> Lets say that I have 10 AlignedRead instances, 10 million tags each.
>>> Lets call those instances aln01 through aln10.
>>>
>>> I can consolidate two of them like this:
>>>
>>> aln <- append(aln01, aln02)
>> I don't think there's anything built-in. You could try this
>>
>>  superDuperConsolidator <- function(...)
>>     Reduce(append, list(...))
>>
>> it might not be too bad memory-wise.
>>
>> Martin
>>
>>> Can I consolidate all AlignRead instances in a single shot? Something like
>>> this:
>>>
>>> aln <- superDuperConsolidator(aln01, aln02, aln03, ..., aln10)
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>> #########################################################
>>>> sessionInfo()
>>> R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] ShortRead_1.3.27   lattice_0.17-25    BSgenome_1.13.10
>>> Biostrings_2.13.34
>>> [5] IRanges_1.3.60
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1
>>>
>>> #########################################################
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list