[Bioc-sig-seq] aggregating a RangedData over an IRanges object
Patrick Aboyoun
paboyoun at fhcrc.org
Thu Jul 16 08:52:21 CEST 2009
Simon,
I have checked in changes to IRanges (now version 1.3.36), BioC 2.5
(devel), that improve the RangedData constructor to accept a RangesList
object for its ranges object as well as added passed the element names
when performing view* operations. I'm going to consider the aggregate
use case some more and get back to you shortly on that one. Here is a
revised example:
> suppressMessages(library(IRanges))
> track <- RangedData(
+ RangesList(
+ chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
+ chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
+ score = c( 2, 7, 3, 1, 1, 1 ) )
> track
RangedData: 6 ranges by 1 column on 2 sequences
colnames(1): score
names(2): chrA chrB
> trackCoverage <- coverage(track, weight="score" )
> trackCoverage
SimpleRleList: 2 elements
names(2): chrA chrB
> exons <- RangesList(
+ chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
+ chrB = IRanges( start = 3, width = 5 ) )
> exons
SimpleRangesList: 2 elements
names(2): chrA chrB
> exonView <- RleViewsList( rleList=trackCoverage, rangesList=exons )
> exonView
SimpleRleViewsList: 2 elements
names(2): chrA chrB
> viewSums(exonView)
SimpleIntegerList: 2 elements
names(2): chrA chrB
> viewSums(exonView)[["chrA"]]
[1] 4 14
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.3.36
Patrick Aboyoun wrote:
> Simon,
> Answering your issues in turn:
>
> Issue 1) Given this, it seems to me now that viewApply and coverage
> have very similar use cases. Do we need both?
>
> Reponse 1) Yes. The viewApply function is a general purpose tool for
> implementing a function over a set of Views. The coverage function
> performs a (weighted) depth count of interval/ranges over a chosen
> domain. The viewSums, viewMaxs, etc. are specializations of the
> viewApply function that are built for performance reasons, much like
> the colSums, etc. functions are for matrices.
>
>
> Issue 2) I am still quite worried that nobody outside the Hutch will
> be able to figure out which class to use when. There are simply so
> many classes in IRanges. Do you have a class hierarchy diagram?
>
> Response 2) As a first step Michael and I have fleshed out the IRanges
> vignette
>
> http://bioconductor.org/packages/2.5/bioc/vignettes/IRanges/inst/doc/IRangesOverview.pdf
>
>
> to explain more about the IRanges make up. The next step in the
> documentation is to map the entities in biological sequence analysis
> to IRanges classes. The chipseq package used an amorphous GenomeData
> class for this purpose, but that is just a glorified list that in one
> context can store coverage information and in another context can
> store exon boundaries. I have been trying to revise the chipseq
> package to be more specific on what data type should represent what
> sequence analysis entity. So for example, an RleList object is a
> natural representation of mapped read coverage information across
> chromosomes/contigs. If you want to associate exon boundaries to
> coverage data, then RleViewsList objects serve as a natural
> representation. Once we list out all of the sequence analysis
> entities, we can create a two column table mapping the entity to the
> Bioconductor class. We can then create something akin to a UML
> sequence diagram to show how you get from one object type to another
> in a typical sequence analysis workflow.
>
>
> Issue 3) The RangedData constructor isn't intuitive. The following
> command
>
> v3b <- RangedData(
> RangesList(
> chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
> chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
> score = c( 2, 7, 3, 1, 1, 1 ) )
>
> doesn't work as expected.
>
> Response 3) The constructor wasn't general enough. I will fix this
> shortly.
>
>
> Issue 4) The aggregate method used for *List objects aren't vectorized
> over list elements and the error message isn't useful.
>
> Reponse 4) I need to think about this one for a bit because one could
> feasibly wish to aggregate across list elements and not within them.
> Perhaps I could check the nature of the by argument and if it is a
> "list", then aggregate within elements. Otherwise, aggregate between
> elements as it currently does.
>
>
> Issue 5) The view* functions don't keep element names from original
> object.
>
> Reponse 5) This was an oversight on my part and I will correct this
> shortly.
>
>
> I'll send out an e-mail to the group once I check in these changes.
>
>
>
> Patrick
>
>
>
>
> Simon Anders wrote:
>> Hi Patrick
>>
>> Patrick Aboyoun wrote:
>>
>>> I've made further changes to the IRanges package to make it easier to
>>> perform the coverage operations across chromosomes/contigs. I've added
>>> RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to
>>> IRanges as well as a RleViewsList constructor and viewSums and other
>>> view* summary methods for RleViewList objects. The code below looks
>>> much
>>> cleaner now due to the RleViewsList constructor and the viewSums
>>> method.
>>> Also, I point out the viewApply function, which serves in the capacity
>>> of the aggregate function that you were experimenting with in your
>>> e-mails. Do these steps fit into your workflow?
>>>
>>
>> Given this, it seems to me now that viewApply and coverage have very
>> similar use cases. Do we need both?
>>
>> I am still quite worried that nobody outside the Hutch will be able to
>> figure out which class to use when. There are simply so many classes in
>> IRanges. Do you have a class hierarchy diagram?
>>
>> A justification for all the RangedData-style classes is, of course, that
>> one usually deals with multiple chromosomes. So, I went through the
>> example of my last mail again, now with two chromosomes.
>>
>> Last time, we had this here, a RangedData with a single, unnamed
>> sequence, and one column, 'score':
>>
>>
>>> v3 <- RangedData(
>>>
>> + IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ),
>> + score = c( 2, 7, 3 ) )
>>
>>> v3
>>>
>> RangedData: 3 ranges by 1 column on 1 sequence
>> colnames(1): score
>>
>> Now, I attempt to make a RangedData with two sequences, 'chrA' and
>> 'chrB':
>>
>>
>>> v3b <- RangedData(
>>>
>> + RangesList(
>> + chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
>> + chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
>> + score = c( 2, 7, 3, 1, 1, 1 ) )
>>
>>> v3b
>>>
>> RangedData: 6 ranges by 0 columns on 2 sequences
>> colnames(0):
>> names(2): chrA chrB
>>
>> As you can see, the 'score' argument was ignored. I've tried a couple of
>> other syntaxes, but could not get it in. What's the trick?
>>
>> To proceed, I add my scores manually:
>>
>>
>>> v3b$score <- c( 2, 7, 3, 1, 1, 1 )
>>>
>>
>> I still feel a bit uneasy about the fact that the split between the
>> sequences is not apparent here, but maybe it is simplest like this.
>>
>> I now switch over to coverage vectors:
>>
>>
>>> v2b <- coverage( v3b, weight="score" )
>>>
>>
>> This has now nicely assembled everything over both chromosomes:
>>
>>
>>> as.vector( v2b$chrA )
>>>
>> [1] 2 2 2 7 7 3 3 3 3
>>
>>> as.vector( v2b$chrB )
>>>
>> [1] 1 1 2 1 1 1 1 1 1
>>
>> Let's get some exons, two on chrA and one on chrB:
>>
>>
>>> exons <- RangesList(
>>>
>> + chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
>> + chrB = IRanges( start = 3, width = 5 ) )
>>
>> Now, 'aggregate' fails, with a rather misleading error message:
>>
>>
>>> aggregate( v2b, exons, sum )
>>>
>> Error in solveWindowSEW(length(x), start = ifelse(is.null(start), NA, :
>> Invalid sequence coordinates.
>> Please make sure the supplied 'start', 'end' and 'width' arguments
>> are defining a region that is within the limits of the sequence.
>>
>> So, I'll try with views:
>>
>>
>>> v2exons <- RleViewsList( rleList=v2b, rangesList=exons )
>>>
>>
>>
>>> viewSums( v2exons )
>>>
>> SimpleIntegerList: 2 elements
>>
>>
>>> viewSums( v2exons )[[1]]
>>>
>> [1] 4 14
>>
>>> viewSums( v2exons )[[2]]
>>>
>> [1] 6
>>
>> This worked, but we lost the sequence names:
>>
>>
>>> viewSums( v2exons )[["chrA"]]
>>>
>> NULL
>>
>>> viewSums( v2exons )$chrA
>>>
>> NULL
>>
>> I suppose that we again rely here on the sequences being in the same
>> order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the
>> other, say, 'chrB' and 'chrC', this would not get spotted, or would it?
>>
>>
>> Sorry for the length nit-picking mails but this gets all a bit
>> complex...
>>
>> Cheers
>> Simon
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list