[Bioc-sig-seq] RangedData objects. Redefining widths with conditions.
Ivan Gregoretti
ivangreg at gmail.com
Fri Apr 23 20:28:53 CEST 2010
Hi Michael,
With the GRanges object, resizing becomes a breeze. Thank you.
For the purpose of leaving this operation documented, I will
copy/paste my minimalist code:
library(rtracklayer) # needed by import()
library(BSgenome.Mmusculus.UCSC.mm9) # needed for chromosome lengths
# load the features
A <- import('hundredmilliontags.bed.gz', 'bed')
# coerce to GRanges
A <- as(A, 'GRanges')
# Be elegant, supply chromosome lengths
seqlengths(A) <- sapply(names(seqlengths(A)),
function(x){length(Mmusculus[[x]])})
# voila, proper resizing
resize(A, width=200)
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
On Fri, Apr 23, 2010 at 11:08 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
>
> On Fri, Apr 23, 2010 at 7:42 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>
>> Hi Steve,
>>
>> What you showed worked. No question, but I found resize() to be
>> unprepared to convenient use in RangedData objects.
>>
>> For example, consider a more biological set of data
>>
>> Z <- RangedData(
>> RangesList(
>> chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
>> chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4))),
>> score = c( 2, 7, 3, 1, 1, 1 ),
>> strand= c('+','+','-','+','-','-') )
>>
>> > Z
>> RangedData with 6 rows and 2 value columns across 2 spaces
>> space ranges | score strand
>> <character> <IRanges> | <numeric> <character>
>> 1 chrA [1, 3] | 2 +
>> 2 chrA [4, 5] | 7 +
>> 3 chrA [6, 9] | 3 -
>> 4 chrB [1, 3] | 1 +
>> 5 chrB [3, 5] | 1 -
>> 6 chrB [6, 9] | 1 -
>>
>> here is resize() inconvenience
>>
>> resize(Z, width=200, fix=ifelse(Z$strand=='+','start','end'))
>> Error in function (classes, fdef, mtable) :
>> unable to find an inherited method for function "resize", for
>> signature "RangedData"
>>
>> What does work is ranges(Z) rather than Z itself:
>> > resize(ranges(Z), width=200, fix=ifelse(Z$strand=='+','start','end'))
>> SimpleRangesList of length 2
>> $chrA
>> IRanges of length 3
>> start end width
>> [1] 1 200 200
>> [2] 4 203 200
>> [3] -190 9 200
>>
>> $chrB
>> IRanges of length 3
>> start end width
>> [1] 1 200 200
>> [2] 3 202 200
>> [3] -190 9 200
>>
>> but as you see, the RangedData object is lost. You have to coerce it:
>>
>> > as(resize(ranges(Z), width=200,
>> > fix=ifelse(Z$strand=='+','start','end')), 'RangedData')
>> RangedData with 6 rows and 0 value columns across 2 spaces
>> space ranges |
>> <character> <IRanges> |
>> 1 chrA [ 1, 200] |
>> 2 chrA [ 4, 203] |
>> 3 chrA [-190, 9] |
>> 4 chrB [ 1, 200] |
>> 5 chrB [ 3, 202] |
>> 6 chrB [-190, 9] |
>>
>> Now I got a RangedData object but the value columns are still lost. I
>> have to reconstruct it.
>>
>> [warning: the following command is obnoxious]
>>
>>
>> > as(cbind(as.data.frame(as(resize(ranges(Z), width=200,
>> > fix=ifelse(Z$strand=='+','start','end')), 'RangedData')),
>> > as.data.frame(Z)[,5:dim(Z)[1]]), 'RangedData')
>> RangedData with 6 rows and 2 value columns across 2 spaces
>> space ranges | score strand
>> <character> <IRanges> | <numeric> <factor>
>> 1 chrA [ 1, 200] | 2 +
>> 2 chrA [ 4, 203] | 7 +
>> 3 chrA [-190, 9] | 3 -
>> 4 chrB [ 1, 200] | 1 +
>> 5 chrB [ 3, 202] | 1 -
>> 6 chrB [-190, 9] | 1 -
>>
>> Granted. It works, but wouldn't it be more convenient this?:
>>
>> resize(Z, width=200, fix=ifelse(Z$strand=='+','start','end'))
>>
>> Z is a tiny toy example, biological sets are regularly multi-million
>> rows. My set is over 100 million rows; as I write this, my 144GB RAM
>> machine is doing the resizing the 'long way round', as obnoxiously
>> shown . Still working.........
>>
>> I wonder if there is a 'cheaper' way resize a large RangedData
>> instance. A better solution would be to upgrade resize() but I am not
>> that R-skilled. I hope the developers will consider it.
>>
>
> This would be a simple addition, but there is the bigger question of whether
> RangedData should implement the Ranges API. It's really more of a "dataset
> with ranges" than "ranges with data". RangedData *does* implement the
> findOverlaps family of functions since they are used so commonly. There are
> also "short cuts" to the starts, ends and widths.
>
> You might find GRanges more convenient for your use-case. resize,GRanges
> automatically considers the strand in the expected way.
>
> Also, there is a short-cut like:
>
> resizedRanges <- resize(ranges(Z), width=200, fix=ifelse(Z$strand=='+',
> start','end'))
> ranges(Z) <- resizedRanges
>
> Michael
>
>>
>> Thank you,
>>
>> Ivan
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>>
>>
>>
>> On Thu, Apr 22, 2010 at 5:11 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>> > Hi,
>> >
>> > On Thu, Apr 22, 2010 at 4:17 PM, Ivan Gregoretti <ivangreg at gmail.com>
>> > wrote:
>> >> Hello everybody,
>> >>
>> >> How do you resize() the ranges of a RangedData object?
>> >>
>> >>
>> >> In the past (IRanges 1.4.11), I could
>> >>
>> >> 1) extend forward 200 bases from the start in '+' ranges OR
>> >> 2) extend backward 200 bases from the end in '-' ranges.
>> >>
>> >> The syntax was something like this:
>> >>
>> >> resize(ranges(A), width = 200, start = A$strand == "+")
>> >>
>> >> In IRanges 1.5.70, the "start" argument of resize() has been
>> >> depracated and replaced by "fix".
>> >>
>> >> Can somebody show how to get the task accomplished with the new
>> >> resize()?
>> >
>> > I'm pretty sure you use `fix` just like you use start:
>> >
>> > R> strands <- c("+", '-', '+', '-', '-')
>> > R> ir <- IRanges(c(1,10,20,30, 40), width=5)
>> > R> ir
>> > IRanges of length 5
>> > start end width
>> > [1] 1 5 5
>> > [2] 10 14 5
>> > [3] 20 24 5
>> > [4] 30 34 5
>> > [5] 40 44 5
>> >
>> > R> resize(ir, width=8, fix=ifelse(strands == '+', 'start', 'end'))
>> > IRanges of length 5
>> > start end width
>> > [1] 1 8 8
>> > [2] 7 14 8
>> > [3] 20 27 8
>> > [4] 27 34 8
>> > [5] 37 44 8
>> >
>> > --
>> > Steve Lianoglou
>> > Graduate Student: Computational Systems Biology
>> > | Memorial Sloan-Kettering Cancer Center
>> > | Weill Medical College of Cornell University
>> > Contact Info: http://cbio.mskcc.org/~lianos/contact
>> >
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
More information about the Bioc-sig-sequencing
mailing list