[Bioc-sig-seq] aggregating a RangedData over an IRanges object

Thu Jul 16 00:11:05 CEST 2009

Simon,
Answering your issues in turn:

Issue 1) Given this, it seems to me now that viewApply and coverage have 
very similar use cases. Do we need both?

Reponse 1) Yes. The viewApply function is a general purpose tool for 
implementing a function over a set of Views. The coverage function 
performs a (weighted) depth count of interval/ranges over a chosen 
domain. The viewSums, viewMaxs, etc. are specializations of the 
viewApply function that are built for performance reasons, much like the 
colSums, etc. functions are for matrices.

Issue 2) I am still quite worried that nobody outside the Hutch will be 
able to figure out which class to use when. There are simply so many 
classes in IRanges. Do you have a class hierarchy diagram?

Response 2) As a first step Michael and I have fleshed out the IRanges 
vignette

http://bioconductor.org/packages/2.5/bioc/vignettes/IRanges/inst/doc/IRangesOverview.pdf

to explain more about the IRanges make up. The next step in the 
documentation is to map the entities in biological sequence analysis to 
IRanges classes. The chipseq package used an amorphous GenomeData class 
for this purpose, but that is just a glorified list that in one context 
can store coverage information and in another context can store exon 
boundaries. I have been trying to revise the chipseq package to be more 
specific on what data type should represent what sequence analysis 
entity. So for example, an RleList object is a natural representation of 
mapped read coverage information across chromosomes/contigs. If you want 
to associate exon boundaries to coverage data, then RleViewsList objects 
serve as a natural representation. Once we list out all of the sequence 
analysis entities, we can create a two column table mapping the entity 
to the Bioconductor class. We can then create something akin to a UML 
sequence diagram to show how you get from one object type to another in 
a typical sequence analysis workflow.

Issue 3) The RangedData constructor isn't intuitive. The following command

v3b <- RangedData(
     RangesList(
     chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
    chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
    score = c( 2, 7, 3, 1, 1, 1 ) )

doesn't work as expected.

Response 3) The constructor wasn't general enough. I will fix this shortly.

Issue 4) The aggregate method used for *List objects aren't vectorized 
over list elements and the error message isn't useful.

Reponse 4) I need to think about this one for a bit because one could 
feasibly wish to aggregate across list elements and not within them. 
Perhaps I could check the nature of the by argument and if it is a 
"list", then aggregate within elements. Otherwise, aggregate between 
elements as it currently does.

Issue 5) The view* functions don't keep element names from original object.

Reponse 5) This was an oversight on my part and I will correct this shortly.

I'll send out an e-mail to the group once I check in these changes.

Patrick

Simon Anders wrote:
> Hi Patrick
>
> Patrick Aboyoun wrote:
>   
>> I've made further changes to the IRanges package to make it easier to
>> perform the coverage operations across chromosomes/contigs. I've added
>> RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to
>> IRanges as well as a RleViewsList constructor and viewSums and other
>> view* summary methods for RleViewList objects. The code below looks much
>> cleaner now due to the RleViewsList constructor and the viewSums method.
>> Also, I point out the viewApply function, which serves in the capacity
>> of the aggregate function that you were experimenting with in your
>> e-mails. Do these steps fit into your workflow?
>>     
>
> Given this, it seems to me now that viewApply and coverage have very
> similar use cases. Do we need both?
>
> I am still quite worried that nobody outside the Hutch will be able to
> figure out which class to use when. There are simply so many classes in
> IRanges. Do you have a class hierarchy diagram?
>
> A justification for all the RangedData-style classes is, of course, that
> one usually deals with multiple chromosomes. So, I went through the
> example of my last mail again, now with two chromosomes.
>
> Last time, we had this here, a RangedData with a single, unnamed
> sequence, and one column, 'score':
>
>   
>> v3 <- RangedData(
>>     
> +    IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ),
> +    score = c( 2, 7, 3 ) )
>   
>> v3
>>     
> RangedData: 3 ranges by 1 column on 1 sequence
> colnames(1): score
>
> Now, I attempt to make a RangedData with two sequences, 'chrA' and 'chrB':
>
>   
>> v3b <- RangedData(
>>     
> +    RangesList(
> +       chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
> +       chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
> +    score = c( 2, 7, 3, 1, 1, 1 ) )
>   
>> v3b
>>     
> RangedData: 6 ranges by 0 columns on 2 sequences
> colnames(0):
> names(2): chrA chrB
>
> As you can see, the 'score' argument was ignored. I've tried a couple of
> other syntaxes, but could not get it in. What's the trick?
>
> To proceed, I add my scores manually:
>
>   
>> v3b$score <- c( 2, 7, 3, 1, 1, 1 )
>>     
>
> I still feel a bit uneasy about the fact that the split between the
> sequences is not apparent here, but maybe it is simplest like this.
>
> I now switch over to coverage vectors:
>
>   
>> v2b <- coverage( v3b, weight="score" )
>>     
>
> This has now nicely assembled everything over both chromosomes:
>
>   
>> as.vector( v2b$chrA )
>>     
> [1] 2 2 2 7 7 3 3 3 3
>   
>> as.vector( v2b$chrB )
>>     
> [1] 1 1 2 1 1 1 1 1 1
>
> Let's get some exons, two on chrA and one on chrB:
>
>   
>> exons <- RangesList(
>>     
> +    chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
> +    chrB = IRanges( start = 3, width = 5 ) )
>
> Now, 'aggregate' fails, with a rather misleading error message:
>
>   
>> aggregate( v2b, exons, sum )
>>     
> Error in solveWindowSEW(length(x), start = ifelse(is.null(start), NA,  :
>   Invalid sequence coordinates.
>   Please make sure the supplied 'start', 'end' and 'width' arguments
>   are defining a region that is within the limits of the sequence.
>
> So, I'll try with views:
>
>   
>> v2exons <- RleViewsList( rleList=v2b, rangesList=exons )
>>     
>
>   
>> viewSums( v2exons )
>>     
> SimpleIntegerList: 2 elements
>
>   
>> viewSums( v2exons )[[1]]
>>     
> [1]  4 14
>   
>> viewSums( v2exons )[[2]]
>>     
> [1] 6
>
> This worked, but we lost the sequence names:
>
>   
>> viewSums( v2exons )[["chrA"]]
>>     
> NULL
>   
>> viewSums( v2exons )$chrA
>>     
> NULL
>
> I suppose that we again rely here on the sequences being in the same
> order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the
> other, say, 'chrB' and 'chrC', this would not get spotted, or would it?
>
>
> Sorry for the length nit-picking mails but this gets all a bit complex...
>
> Cheers
>   Simon
>