[Bioc-sig-seq] GRangesList with duplicate names
Martin Morgan
mtmorgan at fhcrc.org
Fri Feb 25 18:46:02 CET 2011
On 02/25/2011 07:05 AM, Steve Lianoglou wrote:
> Hi,
>
> I think I'm with Ivan and leaning towards not allowing duplicate names
> in a GRangesList, even though normal lists in R do allow duplicate
> names.
>
> As Ivan suggested, I also often use the names of any R list when I
> want to use the list as something similar to a Python dictionary.
I cast my vote in the same direction, for similar reasons.
Dario's use case offered a different on GRangesList which I had thought
of as a collection of GRanges in a hierarchical relationship, like
exons-within-genes. Maybe this is just me, though.
Wanted also to suggest some alternatives, with
a <- GRanges("A", IRanges(1:3, width=5))
b <- GRanges("B", IRanges(5:7, width=10))
c <- GRanges("C", IRanges(10:12, width=15))
The first is to use a GRangesList but store the case / control status as
elementMetadata / values and take advantage of the flexibility that
offers to record them as a factor
> grl <- GRangesList(a=a, b=b, c=c)
> values(grl)[["Status"]] <- factor(c("Cancer", "Cancer", "Control"))
The second is to more-or-less honor the notion of GRangesList as a
hierarchy, hence use a different data structure
> lst <- SimpleList(a=a, b=b, c=c)
> df <- DataFrame(Status=factor(c("Cancer", "Cancer", "Control")))
> elementMetadata(lst) <- df
The third might be relevant if the GRanges ('regions of interest') are
actually common across samples, e.g.,
d <- GRanges("D", IRanges(c(1,5, 10), c(7, 16, 26)))
perhaps with measurements made on each
assays <- SimpleList(asinhCounts=matrix(rnorm(9, 6, 2), 3))
and coordinated in a SummarizedExperiment
> ## some additional annotation on rows / cols
> names(d) <- paste("roi", seq_len(length(d)), sep="")
> rownames(df) <- paste("sample", seq_len(nrow(df)), sep="")
> sx <- SummarizedExperiment(assays, rowData=d, colData=df)
> sx
class: SummarizedExperiment
dim: 3 3
assays(1): asinhCounts
rownames(3): roi1 roi2 roi3
rowData values names(0):
colnames(3): sample1 sample2 sample3
colData names(1): Status
where measurements (e.g., asinh-transformed counts) associated with
ranges in all samples are part of 'assays', marginal values associated
with rows / ranges (e.g., significance values associated with
differential expression) are values(rowData(sx)), and marginal values
associated with columns / samples are colData(sx).
Martin
> Still, if the consensus turns out to allow duplicate names in
> *RangesList(s), perhaps it'd be nice for the the validity method to
> fire off a warning that duplicate names exist in the list so the user
> knows something might be fishy.
>
> -steve
>
> On Fri, Feb 25, 2011 at 9:48 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>> Hello Hervé,
>>
>> While we wait for comments from "power users", I just wanted to say
>> that non-unique names open the door for potentially more problems than
>> solutions.
>>
>> Imagine a Python dictionary or a Perl hash with multiple values per key.
>>
>> I wonder how many R/Bioconductor functions exploit the vector's
>> capability to hold multiple elements with the same name.
>>
>> Regardless, thanks for asking users opinions.
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Fri, Feb 25, 2011 at 3:08 AM, Pages, Herve <hpages at fhcrc.org> wrote:
>>> Hi Dario,
>>>
>>> A GRangesList object with duplicated names is apparently
>>> considered broken:
>>>
>>>> grl <- GRangesList(GRanges(), GRanges())
>>>> names(grl) <- c("a", "a")
>>>> validObject(grl)
>>> Error in `rownames<-`(`*tmp*`, value = c("a", "a")) :
>>> duplicate rownames not allowed
>>>
>>> If we are ok with this feature, we should fix the "names<-"
>>> method (and any other code around that lets the user generate
>>> broken objects).
>>>
>>> But if we are not ok with this feature, we should modify
>>> the validity method for GRangesList objects. I tend to prefer
>>> this solution for 3 reasons:
>>>
>>> 1. Consistency with ordinary vectors: the names of a vector
>>> in R are not required to be unique.
>>>
>>> 2. It's not uncommon to see the same name used for 2 different
>>> genes. One might still want to be able to stick those names
>>> on a GRangesList object where each top-level element corresponds
>>> to a gene (e.g. exons grouped by gene).
>>>
>>> 3. It's easier to modify the validity method than to go around
>>> trying to find and fix every piece of code in GenomicRanges
>>> (and maybe other places) that can potentially produce a
>>> GRangesList object with duplicated names.
>>>
>>> How do our power users feel about this?
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> ----- Original Message -----
>>> From: "Dario Strbenac" <D.Strbenac at garvan.org.au>
>>> To: bioc-sig-sequencing at r-project.org
>>> Sent: Thursday, February 24, 2011 10:00:11 PM
>>> Subject: [Bioc-sig-seq] GRangesList with duplicate names
>>>
>>> Hello,
>>>
>>> It is possible to create a GRangesList with duplicated names, but not to re-order it.
>>>
>>>> summary(grl)
>>> Length Class Mode
>>> 3 GRangesList S4
>>>> names(grl) <- c("Cancer", "Cancer", "Normal")
>>>> grl[3:1]
>>> Error in `rownames<-`(`*tmp*`, value = c("Normal", "Cancer", "Cancer")) :
>>> duplicate rownames not allowed
>>>> sessionInfo()
>>> R version 2.12.0 (2010-10-15)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
>>> [5] LC_MONETARY=C LC_MESSAGES=en_AU.UTF-8
>>> [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.2.3 IRanges_1.8.9
>>>
>>> --------------------------------------
>>> Dario Strbenac
>>> Research Assistant
>>> Cancer Epigenetics
>>> Garvan Institute of Medical Research
>>> Darlinghurst NSW 2010
>>> Australia
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list