[BioC] reducing RangedData
Robert Castelo
robert.castelo at upf.edu
Sat Oct 24 21:16:39 CEST 2009
Patrick, it looks great to me, and i think is going to be useful to
many people.
thanks for handling this issue so quickly.
best,
robert.
On Sat, 24 Oct 2009 07:09:07 +0200, Patrick Aboyoun <paboyoun at fhcrc.org>
wrote:
> Robert,
> I just added a reduce method for RangedData objects to BioC 2.5 for R
> 2.10, which is the current development branch and will become the
> release branch on October 28. The signature for this method is
>
> reduce(x, by, with.inframe.attrib = FALSE)
>
> where by specifies the grouping variables from the values columns. So if
> you just have strand in your values columns, a simple reduce(rd) call
> will do. If you have columns in addition to your strand column that you
> don't want to use for grouping during the reduction operation, the call
> would become reduce(rd, "strand"). I also recently changed the show
> methods for many IRanges objects to display the first 10 or so values so
> the user feels more connected to their data. Here is what the code looks
> like now:
>
> > suppressMessages(library(IRanges))
> > sta <- c(1, 2, 5, 6, 2, 3, 4, 5)
> > end <- c(3, 4, 5, 7, 2, 4, 6, 7)
> > str <- rep(c("+", "+", "-", "-"), 2)
> > chr <- rep(c("chr1", "chr2"), each = 4)
> > rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
> > reduce(rd)
> RangedData with 4 rows and 1 value column across 2 spaces
> space ranges | strand
> <character> <IRanges> | <character>
> 1 chr1 [5, 7] | -
> 2 chr1 [1, 4] | +
> 3 chr2 [4, 7] | -
> 4 chr2 [2, 4] | +
> > # if strand is a factor variable, the results are
> > reduce(RangedData(IRanges(start=sta, end=end), strand=factor(str),
> space=chr))
> RangedData with 4 rows and 1 value column across 2 spaces
> space ranges | strand
> <character> <IRanges> | <factor>
> 1 chr1 [5, 7] | -
> 2 chr1 [1, 4] | +
> 3 chr2 [4, 7] | -
> 4 chr2 [2, 4] | +
> > sessionInfo()
> R version 2.10.0 RC (2009-10-18 r50160)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] IRanges_1.3.97
>
>
>
> Patrick Aboyoun wrote:
>> Robert,
>> To address your case of data extraction from the original RangedData
>> object, I would recommend using "[" with i = <<RangesList>>, e.g.
>> origRD[ranges(reducedRd),].
>>
>> The situation becomes more involved when, for example, someone wants to
>> sum up the "score" column values for all the ranges that overlapped.
>> These data summary operations are open ended and don't lend themselves
>> to simple interfaces, although we can provide targeted functions, or
>> add arguments to the reduce function, that support specific data
>> summary operations on ancillary columns.
>>
>>
>> Patrick
>>
>>
>> Robert Castelo wrote:
>>> hi Patrick,
>>>
>>> thanks a lot, the concise code snippet below is what i was looking for,
>>> i'd had written at least twice as much to do the same job :)
>>>
>>> regarding whether it would be nice to have a reduction operation with a
>>> signature that includes other stuff i think that for the particular
>>> case
>>> of the strand it is definitely useful to project (reduce in RangedData
>>> terminology) genomic coordinates on genomic space taking it into
>>> account
>>> because many of the functional elements in the genome are stranded and
>>> projecting doesn't imply always that one is not interested in the
>>> strand
>>> of the projected intervals.
>>>
>>> with respect to whether other information contained in additional
>>> columns should somehow be "reduced" i think it would suffice to enable
>>> the reduction of the "names" of a RangedData (taking, for instance,
>>> arbitrarily one name between two overlapping ranges) because those
>>> names
>>> would allow to go back to the original (un-reduced) data and pull
>>> whatever we're interested in (actually, including the strand).
>>>
>>> my two-cents about this, and thanks again for your quick help.
>>>
>>> robert.
>>>
>>> On Thu, 2009-10-22 at 00:58 -0700, Patrick Aboyoun wrote:
>>>
>>>> Robert,
>>>> I wrote you some code that you can use below. I've seen other
>>>> reduction operations where the user wanted to reduce a RangedData
>>>> type object by more than one column (e.g. strand and score), so
>>>> perhaps an appropriate signature for a reduce method for RangedData
>>>> would be
>>>>
>>>> reduce(x, by, ...)
>>>>
>>>> If this method where written, the remaining question would be should
>>>> it support reducing transformations of the remaining values columns
>>>> or should non "by" columns be dropped from the output. I'm inclined
>>>> towards the latter, since a reduce(x, by, valuesFUN) method wouldn't
>>>> be much simpler than the rdapply method given below. Would a
>>>> straight-forward reduce(x = rd, by = "strand") approach for
>>>> RangedData meet your needs?
>>>>
>>>>
>>>> > suppressMessages(library(IRanges))
>>>> > sta <- c(1, 2, 5, 6, 2, 3, 4, 5)
>>>> > end <- c(3, 4, 5, 7, 2, 4, 6, 7)
>>>> > str <- rep(c("+", "+", "-", "-"), 2)
>>>> > chr <- rep(c("chr1", "chr2"), each = 4)
>>>> > rd <- RangedData(IRanges(start=sta, end=end), strand=str,
>>>> space=chr)
>>>> > rd
>>>> RangedData: 8 ranges by 1 columns
>>>> columns(1): strand
>>>> sequences(2): chr1 chr2
>>>> >
>>>> > # Interesting bit starts here
>>>> > reduceStranded <- function(x) {
>>>> + strand <- x$strand
>>>> + rngs <- ranges(x)[[1]]
>>>> + ans <- IRangesList("+" = reduce(rngs[strand == "+"]),
>>>> + "-" = reduce(rngs[strand == "-"]))
>>>> + RangedData(unlist(ans, use.names = FALSE),
>>>> + strand = rep(c("+", "-"), elementLengths(ans)),
>>>> space = names(x))
>>>> + } > params <- RDApplyParams(rd, reduceStranded, reducerFun =
>>>> function(x) do.call(c, unname(x)))
>>>> > rdapply(params)
>>>> RangedData: 4 ranges by 1 columns
>>>> columns(1): strand
>>>> sequences(2): chr1 chr2
>>>> > as.data.frame(rdapply(params))
>>>> space start end width strand
>>>> 1 chr1 1 4 4 +
>>>> 2 chr1 5 7 3 -
>>>> 3 chr2 2 4 3 +
>>>> 4 chr2 4 7 4 -
>>>> > sessionInfo()
>>>> R version 2.9.2 Patched (2009-08-24 r49420)
>>>> i386-apple-darwin9.8.0
>>>>
>>>> locale:
>>>> C/en_US.UTF-8/C/C/C/C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods
>>>> base other attached packages:
>>>> [1] IRanges_1.2.3
>>>>
>>>>
>>>>
>>>>
>>>> Robert Castelo wrote:
>>>>
>>>>> dear list,
>>>>>
>>>>> i'd like to project a set of exons on genomic space but keeping
>>>>> the strand information. let's assume for simplicity that no exons
>>>>> overlap in opposite strands so no conflicts need to be resolved
>>>>> regarding that case. in principle this is easy without keeping the
>>>>> strand using a RangeList and the reduce method. however i've been
>>>>> struggling for a while without success to figure out how can i
>>>>> "reduce" my coordinates without loosing the strand information.
>>>>>
>>>>> my guess is that to carry out the strand information i need to
>>>>> use a RangedData object:
>>>>>
>>>>> sta <- c(1, 1, 1)
>>>>> end <- c(3, 4, 5)
>>>>> str <- c("+", "+", "-")
>>>>> rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
>>>>> rd
>>>>> RangedData: 3 ranges by 1 columns
>>>>> columns(1): strand
>>>>> sequences(2): chr1 chr2
>>>>>
>>>>> and then use rdapply to reduce on each of the chromosomes but i
>>>>> either don't know how to directly apply reduce:
>>>>>
>>>>> params <- RDApplyParams(rd, reduce)
>>>>> rdapply(params)
>>>>> Error in function (classes, fdef, mtable) :
>>>>> unable to find an inherited method for function "reduce", for
>>>>> signature "RangedData"
>>>>>
>>>>>
>>>>> or i loose the strand information:
>>>>>
>>>>> params <- RDApplyParams(rd, function(rd) reduce(ranges(rd)))
>>>>> rdapply(params)
>>>>> $chr1
>>>>> RangesList: 1 range
>>>>> 1: 1:4
>>>>>
>>>>> $chr2
>>>>> RangesList: 1 range
>>>>> 1: 1:5
>>>>>
>>>>> thanks for your help!
>>>>> robert.
>>>>>
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.9.1 (2009-06-26)
>>>>> i386-apple-darwin8.11.1
>>>>>
>>>>> locale:
>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] rtracklayer_1.4.1 RCurl_0.98-1
>>>>> [3] bitops_1.0-4.1 GOstats_2.10.0
>>>>> [5] graph_1.22.2 Category_2.10.1
>>>>> [7] geneplotter_1.22.0 lattice_0.17-25
>>>>> [9] limma_2.18.2 bgSpJncArrayDm1.db_1.0.1
>>>>> [11] org.Dm.eg.db_2.2.11 RSQLite_0.7-1
>>>>> [13] DBI_0.2-4
>>>>> BSgenome.Dmelanogaster.UCSC.dm1_1.0.0
>>>>> [15] BSgenome_1.12.3 RColorBrewer_1.0-2
>>>>> [17] Biostrings_2.12.8 IRanges_1.2.3
>>>>> [19] annotate_1.22.0 AnnotationDbi_1.6.1
>>>>> [21] Biobase_2.4.1 xtable_1.5-5
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] genefilter_1.24.2 GO.db_2.2.11 grid_2.9.1
>>>>> GSEABase_1.6.1
>>>>> [5] RBGL_1.20.0 splines_2.9.1 survival_2.35-4 tools_2.9.1
>>>>> [9] XML_2.5-3
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list