[BioC] reducing RangedData

Sat Oct 24 21:16:39 CEST 2009

Patrick, it looks great to me, and i think is going to be useful to
many people.

thanks for handling this issue so quickly.

best,
robert.

On Sat, 24 Oct 2009 07:09:07 +0200, Patrick Aboyoun <paboyoun at fhcrc.org>
wrote:

> Robert,
> I just added a reduce method for RangedData objects to BioC 2.5 for R  
> 2.10, which is the current development branch and will become the  
> release branch on October 28. The signature for this method is
>
>     reduce(x, by, with.inframe.attrib = FALSE)
>
> where by specifies the grouping variables from the values columns. So if  
> you just have strand in your values columns, a simple reduce(rd) call  
> will do. If you have columns in addition to your strand column that you  
> don't want to use for grouping during the reduction operation, the call  
> would become reduce(rd, "strand"). I also recently changed the show  
> methods for many IRanges objects to display the first 10 or so values so  
> the user feels more connected to their data. Here is what the code looks  
> like now:
>
>  > suppressMessages(library(IRanges))
>  > sta <- c(1, 2, 5, 6, 2, 3, 4, 5)
>  > end <- c(3, 4, 5, 7, 2, 4, 6, 7)
>  > str <- rep(c("+", "+", "-", "-"), 2)
>  > chr <- rep(c("chr1", "chr2"), each = 4)
>  > rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
>  > reduce(rd)
> RangedData with 4 rows and 1 value column across 2 spaces
>         space    ranges |      strand
>   <character> <IRanges> | <character>
> 1        chr1    [5, 7] |           -
> 2        chr1    [1, 4] |           +
> 3        chr2    [4, 7] |           -
> 4        chr2    [2, 4] |           +
>  > # if strand is a factor variable, the results are
>  > reduce(RangedData(IRanges(start=sta, end=end), strand=factor(str),  
> space=chr))
> RangedData with 4 rows and 1 value column across 2 spaces
>         space    ranges |   strand
>   <character> <IRanges> | <factor>
> 1        chr1    [5, 7] |        -
> 2        chr1    [1, 4] |        +
> 3        chr2    [4, 7] |        -
> 4        chr2    [2, 4] |        +
>  > sessionInfo()
> R version 2.10.0 RC (2009-10-18 r50160)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] IRanges_1.3.97
>
>
>
> Patrick Aboyoun wrote:
>> Robert,
>> To address your case of data extraction from the original RangedData  
>> object, I would recommend using "[" with i = <<RangesList>>, e.g.  
>> origRD[ranges(reducedRd),].
>>
>> The situation becomes more involved when, for example, someone wants to  
>> sum up the "score" column values for all the ranges that overlapped.  
>> These data summary operations are open ended and don't lend themselves  
>> to simple interfaces, although we can provide targeted functions, or  
>> add arguments to the reduce function, that support specific data  
>> summary operations on ancillary columns.
>>
>>
>> Patrick
>>
>>
>> Robert Castelo wrote:
>>> hi Patrick,
>>>
>>> thanks a lot, the concise code snippet below is what i was looking for,
>>> i'd had written at least twice as much to do the same job :)
>>>
>>> regarding whether it would be nice to have a reduction operation with a
>>> signature that includes other stuff i think that for the particular  
>>> case
>>> of the strand it is definitely useful to project (reduce in RangedData
>>> terminology) genomic coordinates on genomic space taking it into  
>>> account
>>> because many of the functional elements in the genome are stranded and
>>> projecting doesn't imply always that one is not interested in the  
>>> strand
>>> of the projected intervals.
>>>
>>> with respect to whether other information contained in additional
>>> columns should somehow be "reduced" i think it would suffice to enable
>>> the reduction of the "names" of a RangedData (taking, for instance,
>>> arbitrarily one name between two overlapping ranges) because those  
>>> names
>>> would allow to go back to the original (un-reduced) data and pull
>>> whatever we're interested in (actually, including the strand).
>>>
>>> my two-cents about this, and thanks again for your quick help.
>>>
>>> robert.
>>>
>>> On Thu, 2009-10-22 at 00:58 -0700, Patrick Aboyoun wrote:
>>>
>>>> Robert,
>>>> I wrote you some code that you can use below. I've seen other  
>>>> reduction operations where the user wanted to reduce a RangedData  
>>>> type object by more than one column (e.g. strand and score), so  
>>>> perhaps an appropriate signature for a reduce method for RangedData  
>>>> would be
>>>>
>>>> reduce(x, by, ...)
>>>>
>>>> If this method where written, the remaining question would be should  
>>>> it support reducing transformations of the remaining values columns  
>>>> or should non "by" columns be dropped from the output. I'm inclined  
>>>> towards the latter, since a reduce(x, by, valuesFUN) method wouldn't  
>>>> be much simpler than the rdapply method given below. Would a  
>>>> straight-forward reduce(x = rd, by = "strand") approach for  
>>>> RangedData meet your needs?
>>>>
>>>>
>>>>  > suppressMessages(library(IRanges))
>>>>  > sta <- c(1, 2, 5, 6, 2, 3, 4, 5)
>>>>  > end <- c(3, 4, 5, 7, 2, 4, 6, 7)
>>>>  > str <- rep(c("+", "+", "-", "-"), 2)
>>>>  > chr <- rep(c("chr1", "chr2"), each = 4)
>>>>  > rd <- RangedData(IRanges(start=sta, end=end), strand=str,  
>>>> space=chr)
>>>>  > rd
>>>> RangedData: 8 ranges by 1 columns
>>>> columns(1): strand
>>>> sequences(2): chr1 chr2
>>>>  >
>>>>  > # Interesting bit starts here
>>>>  > reduceStranded <- function(x) {
>>>> +     strand <- x$strand
>>>> +     rngs <- ranges(x)[[1]]
>>>> +     ans <- IRangesList("+" = reduce(rngs[strand == "+"]),
>>>> +                        "-" = reduce(rngs[strand == "-"]))
>>>> +     RangedData(unlist(ans, use.names = FALSE),
>>>> +                strand = rep(c("+", "-"), elementLengths(ans)),
>>>>                  space = names(x))
>>>> + }  > params <- RDApplyParams(rd, reduceStranded, reducerFun =  
>>>> function(x) do.call(c, unname(x)))
>>>>  > rdapply(params)
>>>> RangedData: 4 ranges by 1 columns
>>>> columns(1): strand
>>>> sequences(2): chr1 chr2
>>>>  > as.data.frame(rdapply(params))
>>>>   space start end width strand
>>>> 1  chr1     1   4     4      +
>>>> 2  chr1     5   7     3      -
>>>> 3  chr2     2   4     3      +
>>>> 4  chr2     4   7     4      -
>>>>  > sessionInfo()
>>>> R version 2.9.2 Patched (2009-08-24 r49420)
>>>> i386-apple-darwin9.8.0
>>>>
>>>> locale:
>>>> C/en_US.UTF-8/C/C/C/C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods    
>>>> base   other attached packages:
>>>> [1] IRanges_1.2.3
>>>>
>>>>
>>>>
>>>>
>>>> Robert Castelo wrote:
>>>>
>>>>> dear list,
>>>>>
>>>>> i'd like to project a set of exons on genomic space but keeping
>>>>> the strand information. let's assume for simplicity that no exons
>>>>> overlap in opposite strands so no conflicts need to be resolved
>>>>> regarding that case. in principle this is easy without keeping the
>>>>> strand using a RangeList and the reduce method. however i've been
>>>>> struggling for a while without success to figure out how can i
>>>>> "reduce" my coordinates without loosing the strand information.
>>>>>
>>>>> my guess is that to carry out the strand information i need to
>>>>> use a RangedData object:
>>>>>
>>>>> sta <- c(1, 1, 1)
>>>>> end <- c(3, 4, 5)
>>>>> str <- c("+", "+", "-")
>>>>> rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
>>>>> rd
>>>>> RangedData: 3 ranges by 1 columns
>>>>> columns(1): strand
>>>>> sequences(2): chr1 chr2
>>>>>
>>>>> and then use rdapply to reduce on each of the chromosomes but i
>>>>> either don't know how to directly apply reduce:
>>>>>
>>>>> params <- RDApplyParams(rd, reduce)
>>>>> rdapply(params)
>>>>> Error in function (classes, fdef, mtable)  :
>>>>>   unable to find an inherited method for function "reduce", for  
>>>>> signature "RangedData"
>>>>>
>>>>>
>>>>> or i loose the strand information:
>>>>>
>>>>> params <- RDApplyParams(rd, function(rd) reduce(ranges(rd)))
>>>>> rdapply(params)
>>>>> $chr1
>>>>> RangesList: 1 range
>>>>> 1: 1:4
>>>>>
>>>>> $chr2
>>>>> RangesList: 1 range
>>>>> 1: 1:5
>>>>>
>>>>> thanks for your help!
>>>>> robert.
>>>>>
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.9.1 (2009-06-26)
>>>>> i386-apple-darwin8.11.1
>>>>>
>>>>> locale:
>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>>  [1] rtracklayer_1.4.1                     RCurl_0.98-1
>>>>>  [3] bitops_1.0-4.1                        GOstats_2.10.0
>>>>>  [5] graph_1.22.2                          Category_2.10.1
>>>>>  [7] geneplotter_1.22.0                    lattice_0.17-25
>>>>>  [9] limma_2.18.2                          bgSpJncArrayDm1.db_1.0.1
>>>>> [11] org.Dm.eg.db_2.2.11                   RSQLite_0.7-1
>>>>> [13] DBI_0.2-4                              
>>>>> BSgenome.Dmelanogaster.UCSC.dm1_1.0.0
>>>>> [15] BSgenome_1.12.3                       RColorBrewer_1.0-2
>>>>> [17] Biostrings_2.12.8                     IRanges_1.2.3
>>>>> [19] annotate_1.22.0                       AnnotationDbi_1.6.1
>>>>> [21] Biobase_2.4.1                         xtable_1.5-5
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] genefilter_1.24.2 GO.db_2.2.11      grid_2.9.1         
>>>>> GSEABase_1.6.1
>>>>> [5] RBGL_1.20.0       splines_2.9.1     survival_2.35-4   tools_2.9.1
>>>>> [9] XML_2.5-3
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:  
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:  
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>