[Bioc-sig-seq] aggregating a RangedData over an IRanges object
Patrick Aboyoun
paboyoun at fhcrc.org
Thu Jul 16 00:11:05 CEST 2009
Simon,
Answering your issues in turn:
Issue 1) Given this, it seems to me now that viewApply and coverage have
very similar use cases. Do we need both?
Reponse 1) Yes. The viewApply function is a general purpose tool for
implementing a function over a set of Views. The coverage function
performs a (weighted) depth count of interval/ranges over a chosen
domain. The viewSums, viewMaxs, etc. are specializations of the
viewApply function that are built for performance reasons, much like the
colSums, etc. functions are for matrices.
Issue 2) I am still quite worried that nobody outside the Hutch will be
able to figure out which class to use when. There are simply so many
classes in IRanges. Do you have a class hierarchy diagram?
Response 2) As a first step Michael and I have fleshed out the IRanges
vignette
http://bioconductor.org/packages/2.5/bioc/vignettes/IRanges/inst/doc/IRangesOverview.pdf
to explain more about the IRanges make up. The next step in the
documentation is to map the entities in biological sequence analysis to
IRanges classes. The chipseq package used an amorphous GenomeData class
for this purpose, but that is just a glorified list that in one context
can store coverage information and in another context can store exon
boundaries. I have been trying to revise the chipseq package to be more
specific on what data type should represent what sequence analysis
entity. So for example, an RleList object is a natural representation of
mapped read coverage information across chromosomes/contigs. If you want
to associate exon boundaries to coverage data, then RleViewsList objects
serve as a natural representation. Once we list out all of the sequence
analysis entities, we can create a two column table mapping the entity
to the Bioconductor class. We can then create something akin to a UML
sequence diagram to show how you get from one object type to another in
a typical sequence analysis workflow.
Issue 3) The RangedData constructor isn't intuitive. The following command
v3b <- RangedData(
RangesList(
chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
score = c( 2, 7, 3, 1, 1, 1 ) )
doesn't work as expected.
Response 3) The constructor wasn't general enough. I will fix this shortly.
Issue 4) The aggregate method used for *List objects aren't vectorized
over list elements and the error message isn't useful.
Reponse 4) I need to think about this one for a bit because one could
feasibly wish to aggregate across list elements and not within them.
Perhaps I could check the nature of the by argument and if it is a
"list", then aggregate within elements. Otherwise, aggregate between
elements as it currently does.
Issue 5) The view* functions don't keep element names from original object.
Reponse 5) This was an oversight on my part and I will correct this shortly.
I'll send out an e-mail to the group once I check in these changes.
Patrick
Simon Anders wrote:
> Hi Patrick
>
> Patrick Aboyoun wrote:
>
>> I've made further changes to the IRanges package to make it easier to
>> perform the coverage operations across chromosomes/contigs. I've added
>> RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to
>> IRanges as well as a RleViewsList constructor and viewSums and other
>> view* summary methods for RleViewList objects. The code below looks much
>> cleaner now due to the RleViewsList constructor and the viewSums method.
>> Also, I point out the viewApply function, which serves in the capacity
>> of the aggregate function that you were experimenting with in your
>> e-mails. Do these steps fit into your workflow?
>>
>
> Given this, it seems to me now that viewApply and coverage have very
> similar use cases. Do we need both?
>
> I am still quite worried that nobody outside the Hutch will be able to
> figure out which class to use when. There are simply so many classes in
> IRanges. Do you have a class hierarchy diagram?
>
> A justification for all the RangedData-style classes is, of course, that
> one usually deals with multiple chromosomes. So, I went through the
> example of my last mail again, now with two chromosomes.
>
> Last time, we had this here, a RangedData with a single, unnamed
> sequence, and one column, 'score':
>
>
>> v3 <- RangedData(
>>
> + IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ),
> + score = c( 2, 7, 3 ) )
>
>> v3
>>
> RangedData: 3 ranges by 1 column on 1 sequence
> colnames(1): score
>
> Now, I attempt to make a RangedData with two sequences, 'chrA' and 'chrB':
>
>
>> v3b <- RangedData(
>>
> + RangesList(
> + chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
> + chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
> + score = c( 2, 7, 3, 1, 1, 1 ) )
>
>> v3b
>>
> RangedData: 6 ranges by 0 columns on 2 sequences
> colnames(0):
> names(2): chrA chrB
>
> As you can see, the 'score' argument was ignored. I've tried a couple of
> other syntaxes, but could not get it in. What's the trick?
>
> To proceed, I add my scores manually:
>
>
>> v3b$score <- c( 2, 7, 3, 1, 1, 1 )
>>
>
> I still feel a bit uneasy about the fact that the split between the
> sequences is not apparent here, but maybe it is simplest like this.
>
> I now switch over to coverage vectors:
>
>
>> v2b <- coverage( v3b, weight="score" )
>>
>
> This has now nicely assembled everything over both chromosomes:
>
>
>> as.vector( v2b$chrA )
>>
> [1] 2 2 2 7 7 3 3 3 3
>
>> as.vector( v2b$chrB )
>>
> [1] 1 1 2 1 1 1 1 1 1
>
> Let's get some exons, two on chrA and one on chrB:
>
>
>> exons <- RangesList(
>>
> + chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
> + chrB = IRanges( start = 3, width = 5 ) )
>
> Now, 'aggregate' fails, with a rather misleading error message:
>
>
>> aggregate( v2b, exons, sum )
>>
> Error in solveWindowSEW(length(x), start = ifelse(is.null(start), NA, :
> Invalid sequence coordinates.
> Please make sure the supplied 'start', 'end' and 'width' arguments
> are defining a region that is within the limits of the sequence.
>
> So, I'll try with views:
>
>
>> v2exons <- RleViewsList( rleList=v2b, rangesList=exons )
>>
>
>
>> viewSums( v2exons )
>>
> SimpleIntegerList: 2 elements
>
>
>> viewSums( v2exons )[[1]]
>>
> [1] 4 14
>
>> viewSums( v2exons )[[2]]
>>
> [1] 6
>
> This worked, but we lost the sequence names:
>
>
>> viewSums( v2exons )[["chrA"]]
>>
> NULL
>
>> viewSums( v2exons )$chrA
>>
> NULL
>
> I suppose that we again rely here on the sequences being in the same
> order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the
> other, say, 'chrB' and 'chrC', this would not get spotted, or would it?
>
>
> Sorry for the length nit-picking mails but this gets all a bit complex...
>
> Cheers
> Simon
>
More information about the Bioc-sig-sequencing
mailing list