[Bioc-sig-seq] seqselect on SimpleRleList and RangesList - bug? and request

Mon Jun 14 20:37:39 CEST 2010

Thank you both - sounds like this touches on a lot of different  
functions.  I've used findOverlaps quite a bit, so I had assumed other  
functions would work similarly, using the space names as indices. It  
would be good to make it really obvious to the user that isn't the case.

 From my naive user's point of view, it would be really useful to be  
able to use the chromosome names to select portions of a bunch of Rles  
(without requiring parallel space naming), in an analogous way to  
using findOverlaps on RangedData objects. Perhaps a switch on  
seqselect (usenames=TRUE)?  Or introduce another function?

Or, if you go with Patrick's idea below about requiring space names  
either to be all NULL or all distinct and non-empty, then (by analogy  
with findOverlaps), you can use space names if they're present, and if  
they're not present go by list position. As for your question about  
what do when name sets are not identical, I haven't looked at how you  
solved that for findOverlaps but perhaps something analagous could  
work for seqselect.

I guess I can do what I need to do by redefining my range information  
to use ordered factors as space names but it'd be really nice not to  
have to take those extra steps - it seems a lot more complicated than  
it needs to be. I think that'll work, because the way I was using  
seqselect on my real data was to define my ranges as RangedData  
objects, and then pass that to seqselect using  
ranges(my_rangeddata_object).

I can imagine a lot of cases where the user might want to select  
scores from the whole genome on just a subset of ranges that don't  
include one on every chromosome (and where chromosomes might not be  
sorted in the same way as the genome).  Am I thinking about this the  
wrong way?  Maybe there are better ways to represent the scores than  
SimpleRleList that would allow this more easily.

I don't know enough about the inner workings of IRanges to push  
strongly for any particular solution, but from the biologist's point  
of view here are a couple of questions to stimulate discussion.   Why  
allow names to be specified at all if they're not meaningful?  What  
kind of situation would a user be trying to represent with the  
"RangesList(a = IRanges(1,1), a = IRanges(1,2))" example? Could this  
situation simply be disallowed - would that mess up any real examples?

thanks again,

Janet

On Jun 12, 2010, at 6:04 AM, Patrick Aboyoun wrote:

> But what is the scope of space? For example, the reduce operation  
> has no concept of space (see below). In GenomicRanges, we introduced  
> the concept of seqlengths to a number of classes including GRanges  
> and GRangesList. There are certain restrictions of what can be held  
> in a seqlengths slot, for example you can't mix NAs with non-NAs.  
> Perhaps we can formalize space for all List objects so that you  
> either have names of NULL or all the names must be distinct, non- 
> empty strings. We would also have to define what happens in a binary  
> operation involving two List objects when name sets are not identical.
>
>
> > RangesList(a = IRanges(1,1), a = IRanges(1,2))
> SimpleRangesList of length 2
> $a
> IRanges of length 1
>     start end width
> [1]     1   1     1
>
> $a
> IRanges of length 1
>     start end width
> [1]     1   2     2
>
> > validObject(RangesList(a = IRanges(1,1), a = IRanges(1,2)))
> [1] TRUE
> > reduce(RangesList(a = IRanges(1,1), a = IRanges(1,2)))
> SimpleRangesList of length 2
> $a
> IRanges of length 1
>     start end width
> [1]     1   1     1
>
> $a
> IRanges of length 1
>     start end width
> [1]     1   2     2
>
>
>
> Patrick
>
>
>
> On 6/12/10 5:47 AM, Michael Lawrence wrote:
>>
>>
>>
>> On Sat, Jun 12, 2010 at 12:17 AM, Patrick Aboyoun  
>> <paboyoun at fhcrc.org> wrote:
>> Janet,
>> Most function in the IRanges package follows the R convention of  
>> considering the elements of names to be loosely linked attributes  
>> rather than rigid keys. For convenience, functions such as $, [,  
>> [[ treat a list as a hash if it has names, but in most  
>> circumstances the names are ignored or copied without use. Even  
>> when there are names on elements, there are some odd corner cases  
>> that can cause problems. For example, if I wanted to have multiple  
>> list elements with the same name, then some important operations  
>> give unexpected results:
>>
>> > list(a = 1, a = 2)["a"]
>> $a
>> [1] 1
>>
>> If the issue is limited to enhance the seqselect function to make  
>> it name aware, it probably makes sense to go ahead with the  
>> enhancement. But the scope of this issue can grow quite large. For  
>> example, should names be used when adding to RleList objects? What  
>> should the following produce
>>
>> RleList(a = Rle(1)) + RleList(a = Rle(2), a = Rle(3), b = Rle(4))
>>
>> Due to these types of ambiguities, I would rather focus on  
>> educating the user to be mindful that these are position-oriented  
>> rather than key-oriented objects and have them ensure that elements  
>> are in alignment.
>>
>> Thoughts?
>>
>>
>>
>> Sometimes in IRanges the names have a special semantic -- that of a  
>> "space". I guess this is limited to RangesList. Other data  
>> structures, like RleList, are often treated as being separated by  
>> space or chromosome, though their names have never explicitly been  
>> treated as the space. This inconsistency is probably OK, but it  
>> needs to be documented.
>>
>> Patrick
>>
>>
>>
>>
>> On 6/11/10 4:06 PM, Janet Young wrote:
>> Hi,
>>
>> I've been playing around with seqselect on scores stored in a  
>> SimpleRleList object to get subregions defined in a RangesList  
>> object.
>>
>> I found a couple of things:  first an enhancement request - would  
>> it be possible to allow seqselect to deal with cases where not  
>> every space (name) in the SimpleRleList has a corresponding space/ 
>> name in the RangesList object?
>>
>> The second is either bug or else I've misunderstood the way  
>> seqselect is supposed to work, in a dangerous way - it looks like  
>> seqselect doesn't use the names of the list items to select scores,  
>> it just assumes that in the two lists the elements have the same  
>> names in the same order.
>>
>> The code below should explain both issues problem much better than  
>> those descriptions.
>>
>> thanks,
>>
>> Janet
>>
>>
>>
>> > library(IRanges)
>>
>> Attaching package: 'IRanges'
>>
>> The following object(s) are masked from 'package:base':
>>
>>    cbind, Map, mapply, order, paste, pmax, pmax.int, pmin,  
>> pmin.int, rbind, rep.int, table
>>
>> >
>> > ### generate some arbitrary scores
>> > track <- RangedData(RangesList(chrA = IRanges(start = c(1, 4, 6),  
>> width=c(3, 2, 4)),chrB = IRanges(start = c(1, 3, 6), width=c(3, 3,  
>> 4))) )
>> > trackCoverage <- coverage(track,  
>> weight=list(chrA=c(2,7,3),chrB=c(1,1,1)) )
>> >
>> > ### define subregions
>> > exons <- RangesList(chrA = IRanges(start = c(2, 4), width =  
>> c(2,2)),chrB = IRanges(start = 3, width = 5))
>> >
>> > ### seqselect works if all spaces in trackCoverage have an  
>> element in exons
>> > seqselect(trackCoverage,exons )
>> SimpleRleList of length 2
>> $chrA
>> 'integer' Rle of length 4 with 2 runs
>>  Lengths: 2 2
>>  Values : 2 7
>>
>> $chrB
>> 'integer' Rle of length 5 with 2 runs
>>  Lengths: 1 4
>>  Values : 2 1
>>
>> >
>> > ### define subregions only on one chr
>> > exons_chrAonly <- RangesList(chrA = IRanges(start = c(2, 4),  
>> width = c(2, 2)))
>> > ### now seqselect doesn't work if some spaces don't have any  
>> elements
>> > seqselect(trackCoverage,exons_chrAonly )
>> Error in seqselect(trackCoverage, exons_chrAonly) :
>>  'length(start)' must equal 'length(x)' when 'end' and 'width' are  
>> NULL
>> >
>> >
>> > ##### also, defining the regions with spaces in a different order  
>> seems to cause trouble as seqselect doesn't seem to be using the  
>> list's names - just going by order of elements
>> > exons_reorderchrs <- RangesList(chrB = IRanges(start = 3, width =  
>> 5),chrA = IRanges(start = c(2, 4), width = c(2,2)))
>> > seqselect(trackCoverage,exons_reorderchrs )
>> SimpleRleList of length 2
>> $chrA
>> 'integer' Rle of length 5 with 3 runs
>>  Lengths: 1 2 2
>>  Values : 2 7 3
>>
>> $chrB
>> 'integer' Rle of length 4 with 3 runs
>>  Lengths: 1 1 2
>>  Values : 1 2 1
>>
>> >
>> > identical ( seqselect(trackCoverage,exons ) ,  
>> seqselect(trackCoverage,exons_reorderchrs )  )
>> [1] FALSE
>> >
>> > sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] IRanges_1.6.6
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>