[Bioc-sig-seq] aggregating a RangedData over an IRanges object

Thu Jul 16 08:52:21 CEST 2009

Simon,
I have checked in changes to IRanges (now version 1.3.36), BioC 2.5 
(devel), that improve the RangedData constructor to accept a RangesList 
object for its ranges object as well as added passed the element names 
when performing view* operations. I'm going to consider the aggregate 
use case some more and get back to you shortly on that one. Here is a 
revised example:

 > suppressMessages(library(IRanges))
 > track <- RangedData(
+     RangesList(
+        chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
+        chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
+     score = c( 2, 7, 3, 1, 1, 1 ) )
 > track
RangedData: 6 ranges by 1 column on 2 sequences
colnames(1): score
names(2): chrA chrB
 > trackCoverage <- coverage(track, weight="score" )
 > trackCoverage
SimpleRleList: 2 elements
names(2): chrA chrB
 > exons <- RangesList(
+   chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
+   chrB = IRanges( start = 3, width = 5 ) )
 > exons
SimpleRangesList: 2 elements
names(2): chrA chrB
 > exonView <- RleViewsList( rleList=trackCoverage, rangesList=exons )
 > exonView
SimpleRleViewsList: 2 elements
names(2): chrA chrB
 > viewSums(exonView)
SimpleIntegerList: 2 elements
names(2): chrA chrB
 > viewSums(exonView)[["chrA"]]
[1]  4 14
 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] IRanges_1.3.36

Patrick Aboyoun wrote:
> Simon,
> Answering your issues in turn:
>
> Issue 1) Given this, it seems to me now that viewApply and coverage 
> have very similar use cases. Do we need both?
>
> Reponse 1) Yes. The viewApply function is a general purpose tool for 
> implementing a function over a set of Views. The coverage function 
> performs a (weighted) depth count of interval/ranges over a chosen 
> domain. The viewSums, viewMaxs, etc. are specializations of the 
> viewApply function that are built for performance reasons, much like 
> the colSums, etc. functions are for matrices.
>
>
> Issue 2) I am still quite worried that nobody outside the Hutch will 
> be able to figure out which class to use when. There are simply so 
> many classes in IRanges. Do you have a class hierarchy diagram?
>
> Response 2) As a first step Michael and I have fleshed out the IRanges 
> vignette
>
> http://bioconductor.org/packages/2.5/bioc/vignettes/IRanges/inst/doc/IRangesOverview.pdf 
>
>
> to explain more about the IRanges make up. The next step in the 
> documentation is to map the entities in biological sequence analysis 
> to IRanges classes. The chipseq package used an amorphous GenomeData 
> class for this purpose, but that is just a glorified list that in one 
> context can store coverage information and in another context can 
> store exon boundaries. I have been trying to revise the chipseq 
> package to be more specific on what data type should represent what 
> sequence analysis entity. So for example, an RleList object is a 
> natural representation of mapped read coverage information across 
> chromosomes/contigs. If you want to associate exon boundaries to 
> coverage data, then RleViewsList objects serve as a natural 
> representation. Once we list out all of the sequence analysis 
> entities, we can create a two column table mapping the entity to the 
> Bioconductor class. We can then create something akin to a UML 
> sequence diagram to show how you get from one object type to another 
> in a typical sequence analysis workflow.
>
>
> Issue 3) The RangedData constructor isn't intuitive. The following 
> command
>
> v3b <- RangedData(
>     RangesList(
>     chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
>    chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
>    score = c( 2, 7, 3, 1, 1, 1 ) )
>
> doesn't work as expected.
>
> Response 3) The constructor wasn't general enough. I will fix this 
> shortly.
>
>
> Issue 4) The aggregate method used for *List objects aren't vectorized 
> over list elements and the error message isn't useful.
>
> Reponse 4) I need to think about this one for a bit because one could 
> feasibly wish to aggregate across list elements and not within them. 
> Perhaps I could check the nature of the by argument and if it is a 
> "list", then aggregate within elements. Otherwise, aggregate between 
> elements as it currently does.
>
>
> Issue 5) The view* functions don't keep element names from original 
> object.
>
> Reponse 5) This was an oversight on my part and I will correct this 
> shortly.
>
>
> I'll send out an e-mail to the group once I check in these changes.
>
>
>
> Patrick
>
>
>
>
> Simon Anders wrote:
>> Hi Patrick
>>
>> Patrick Aboyoun wrote:
>>  
>>> I've made further changes to the IRanges package to make it easier to
>>> perform the coverage operations across chromosomes/contigs. I've added
>>> RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to
>>> IRanges as well as a RleViewsList constructor and viewSums and other
>>> view* summary methods for RleViewList objects. The code below looks 
>>> much
>>> cleaner now due to the RleViewsList constructor and the viewSums 
>>> method.
>>> Also, I point out the viewApply function, which serves in the capacity
>>> of the aggregate function that you were experimenting with in your
>>> e-mails. Do these steps fit into your workflow?
>>>     
>>
>> Given this, it seems to me now that viewApply and coverage have very
>> similar use cases. Do we need both?
>>
>> I am still quite worried that nobody outside the Hutch will be able to
>> figure out which class to use when. There are simply so many classes in
>> IRanges. Do you have a class hierarchy diagram?
>>
>> A justification for all the RangedData-style classes is, of course, that
>> one usually deals with multiple chromosomes. So, I went through the
>> example of my last mail again, now with two chromosomes.
>>
>> Last time, we had this here, a RangedData with a single, unnamed
>> sequence, and one column, 'score':
>>
>>  
>>> v3 <- RangedData(
>>>     
>> +    IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ),
>> +    score = c( 2, 7, 3 ) )
>>  
>>> v3
>>>     
>> RangedData: 3 ranges by 1 column on 1 sequence
>> colnames(1): score
>>
>> Now, I attempt to make a RangedData with two sequences, 'chrA' and 
>> 'chrB':
>>
>>  
>>> v3b <- RangedData(
>>>     
>> +    RangesList(
>> +       chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
>> +       chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
>> +    score = c( 2, 7, 3, 1, 1, 1 ) )
>>  
>>> v3b
>>>     
>> RangedData: 6 ranges by 0 columns on 2 sequences
>> colnames(0):
>> names(2): chrA chrB
>>
>> As you can see, the 'score' argument was ignored. I've tried a couple of
>> other syntaxes, but could not get it in. What's the trick?
>>
>> To proceed, I add my scores manually:
>>
>>  
>>> v3b$score <- c( 2, 7, 3, 1, 1, 1 )
>>>     
>>
>> I still feel a bit uneasy about the fact that the split between the
>> sequences is not apparent here, but maybe it is simplest like this.
>>
>> I now switch over to coverage vectors:
>>
>>  
>>> v2b <- coverage( v3b, weight="score" )
>>>     
>>
>> This has now nicely assembled everything over both chromosomes:
>>
>>  
>>> as.vector( v2b$chrA )
>>>     
>> [1] 2 2 2 7 7 3 3 3 3
>>  
>>> as.vector( v2b$chrB )
>>>     
>> [1] 1 1 2 1 1 1 1 1 1
>>
>> Let's get some exons, two on chrA and one on chrB:
>>
>>  
>>> exons <- RangesList(
>>>     
>> +    chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
>> +    chrB = IRanges( start = 3, width = 5 ) )
>>
>> Now, 'aggregate' fails, with a rather misleading error message:
>>
>>  
>>> aggregate( v2b, exons, sum )
>>>     
>> Error in solveWindowSEW(length(x), start = ifelse(is.null(start), NA,  :
>>   Invalid sequence coordinates.
>>   Please make sure the supplied 'start', 'end' and 'width' arguments
>>   are defining a region that is within the limits of the sequence.
>>
>> So, I'll try with views:
>>
>>  
>>> v2exons <- RleViewsList( rleList=v2b, rangesList=exons )
>>>     
>>
>>  
>>> viewSums( v2exons )
>>>     
>> SimpleIntegerList: 2 elements
>>
>>  
>>> viewSums( v2exons )[[1]]
>>>     
>> [1]  4 14
>>  
>>> viewSums( v2exons )[[2]]
>>>     
>> [1] 6
>>
>> This worked, but we lost the sequence names:
>>
>>  
>>> viewSums( v2exons )[["chrA"]]
>>>     
>> NULL
>>  
>>> viewSums( v2exons )$chrA
>>>     
>> NULL
>>
>> I suppose that we again rely here on the sequences being in the same
>> order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the
>> other, say, 'chrB' and 'chrC', this would not get spotted, or would it?
>>
>>
>> Sorry for the length nit-picking mails but this gets all a bit 
>> complex...
>>
>> Cheers
>>   Simon
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing