[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Mon Jul 6 20:03:28 CEST 2009

Simon,
As you noted, there are some discrepancies between Map and gdreduce. 
I'll explain what I have done so far and start by taking a step back. 
The IRanges package contains a number of S4 class definitions starting 
with Sequence that mimic the S3 conventions developed for vector in base 
R. (The Sequence man page in IRanges contains all the methods that have 
been developed for Sequence and its subclasses. Most of these methods 
are for "generics" originally developed for S3 vector objects.) Given 
the situation that you were encountering, I added Reduce, Map, Find, 
Filter, Position, and mapply methods for the Sequence class. Given that 
the names attributes for S3 vectors generally serve as labels rather 
than true metadata attributes, functions like Map and mapply don't use 
the names attributes to align elements of vectors (including lists) 
before operating on them. Here is an example using S3 list objects:

 > a <- list(chr1 = 1:10, chr2 = 11:20, chr3 = 21:30)
 > b <- list(chr1 = -(1:10), chr3 = -(21:30), chr2 = -(11:20))
 > Map("+", a, b)
$chr1
 [1] 0 0 0 0 0 0 0 0 0 0

$chr2
 [1] -10 -10 -10 -10 -10 -10 -10 -10 -10 -10

$chr3
 [1] 10 10 10 10 10 10 10 10 10 10

 > as.data.frame(a) + as.data.frame(b)
   chr1 chr2 chr3
1     0  -10   10
2     0  -10   10
3     0  -10   10
4     0  -10   10
5     0  -10   10
6     0  -10   10
7     0  -10   10
8     0  -10   10
9     0  -10   10
10    0  -10   10
 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

If the elements of a and b were aligned by the names attributes, then 
all of the values would have been 0. There are pros and cons to making 
Sequence treat names differently than vector. On the pro side, promoting 
names to a true metadata attribute would avoid the problem you describe. 
On the con side, this complicates the user's software design because 
they can no longer simply implement code for Sequence and its subclasses 
as they would for vector and expect to obtain the same conceptual result.

The GenomeData class is an interesting beast because it was conceptually 
designed as a list with some metadata laid on top to indicate what 
genome the data are related to. There is currently nothing in the class 
that guarantees the names for the list correspond to the names of 
chromosomes for the particular genome in question or even if the list 
elements have names at all:

 > GenomeData(list(1:10, 1:100))
A GenomeData instance
chromosomes(2): 1 2
 > names(GenomeData(list(1:10, 1:100)))
NULL
 > validObject(GenomeData(list(1:10, 1:100)))
[1] TRUE

Therefore methods for GenomeData that assume the elements are named are 
not guaranteed to work for all valid GenomeData objects. I could change 
this by requiring GenomeData to have element names, but I need to talk 
to the class designers to see if this is what they intended and if it 
would break workflows they have designed.

Based on what I have been seeing so far, it appears that GenomeData and 
RangedData are dually serving in a role similar to what 
AnnotatedDataFrame has in the microarray space. We just need to find the 
right mix of flexibility (so people can stores different data types) and 
restrictions (so we can define methods that perform non-trivial 
operations) for these two classes so the users don't have to spend 
needless amounts of time creating clever workarounds if we are too 
restrictive or reinventing the wheel if we are not restrictive enough.

Patrick

Simon Anders wrote:
> Hi Patrick
>
> Patrick Aboyoun wrote:
>   
>> I worked on beefing up the low-level IRanges Sequence class, which has
>> around 80 subclasses including GenomeData, GenomeDataList, RleList, to
>> include functional programming tools like Reduce, Filter, Find, Map,
>>     
> [...]
>
> Thanks, this looks quite useful.
>
>
> I've just played with it and found one issue with chromsome names.
>
> It seems that you Map function finds corresponding Rle vectors in two
> GenomeData objects by looking at the list indices but not at the names.
>
> Here you have two GenomeData objects which have the same values but
> swapped chromosome names:
>
>   
>> gd1 <- GenomeData( list(
>>     
>     chrA = Rle( rep( c(0,3,0), each=20 ) ),
>     chrB = Rle( rep( c(0,2,0), each=30 ) ) ) )
>
>   
>> gd2 <- GenomeData( list(
>>     
>     chrB = Rle( rep( c(0,3,0), each=20 ) ),
>     chrA = Rle( rep( c(0,2,0), each=30 ) ) ) )
>
> If I add them up, I do not add chromosome A to chromosome A, but the
> first chromsome to the first one:
>
>   
>> Map( "+", gd1, gd2 )
>>     
> $chrA
>   'numeric' Rle instance of length 60 with 3 runs
>   Lengths:  20 20 20
>   Values :  0 6 0
>
> $chrB
>   'numeric' Rle instance of length 90 with 3 runs
>   Lengths:  30 30 30
>   Values :  0 4 0
>
> This is of course consistent with what base::Map does but violated the
> semantics of the GenomeData object. Imagine you use 'coverage' on two
> sets of aligned reads, and for some reasons, one chromosome never
> appears in the first set, and another one never appears in the second
> one. You will have two GenomeData objects with the same number of
> chromosomes, and their sum will be completely wrong.
>
>
> Martin's gdreduce function, as I've just noticed, does not have this issue:
>
>   
>> str( gdreduce( "+", gd1, gd2 ) )
>>     
> Formal class 'GenomeData' [package "BSgenome"] with 4 slots
>   ..@ listData       :List of 2
>   .. ..$ chrA:Formal class 'Rle' [package "IRanges"] with 5 slots
>   .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
>   .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
>   .. .. .. ..@ elementMetadata: NULL
>   .. .. .. ..@ elementType    : chr "ANYTHING"
>   .. .. .. ..@ metadata       : list()
>   .. ..$ chrB:Formal class 'Rle' [package "IRanges"] with 5 slots
>   .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
>   .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
>   .. .. .. ..@ elementMetadata: NULL
>   .. .. .. ..@ elementType    : chr "ANYTHING"
>   .. .. .. ..@ metadata       : list()
>   ..@ elementMetadata: NULL
>   ..@ elementType    : chr "ANYTHING"
>   ..@ metadata       :List of 3
>   .. ..$ organism       : NULL
>   .. ..$ provider       : NULL
>   .. ..$ providerVersion: NULL
> Warning messages:
> 1: In f(init, x[[i]]) :
>   longer object length is not a multiple of shorter object length
> 2: In f(init, x[[i]]) :
>   longer object length is not a multiple of shorter object length
>
>
> Finally: Why did you call it 'map' and Martin called it 'reduce'?
> Actually, I think, 'map' is correct for the two-argument case. See the
> following use-case here for an example of reducing with a map function.
> Assuming that 'lanes' is a list of 'AlignedRead' objects, I expect to
> get the coverage, summed over all these lanes (which are from the same
> condition), by writing something like
>
> coverageOfAllLanes <-
>    Reduce(
>       function( gd1, gd2 ) Map( "+", gd1, gd2 ),
>       lapply( lanes, coverage, width = chromLengths ) )
>
>
> Cheers
>   Simon
>