[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes
Patrick Aboyoun
paboyoun at fhcrc.org
Mon Jul 6 20:03:28 CEST 2009
Simon,
As you noted, there are some discrepancies between Map and gdreduce.
I'll explain what I have done so far and start by taking a step back.
The IRanges package contains a number of S4 class definitions starting
with Sequence that mimic the S3 conventions developed for vector in base
R. (The Sequence man page in IRanges contains all the methods that have
been developed for Sequence and its subclasses. Most of these methods
are for "generics" originally developed for S3 vector objects.) Given
the situation that you were encountering, I added Reduce, Map, Find,
Filter, Position, and mapply methods for the Sequence class. Given that
the names attributes for S3 vectors generally serve as labels rather
than true metadata attributes, functions like Map and mapply don't use
the names attributes to align elements of vectors (including lists)
before operating on them. Here is an example using S3 list objects:
> a <- list(chr1 = 1:10, chr2 = 11:20, chr3 = 21:30)
> b <- list(chr1 = -(1:10), chr3 = -(21:30), chr2 = -(11:20))
> Map("+", a, b)
$chr1
[1] 0 0 0 0 0 0 0 0 0 0
$chr2
[1] -10 -10 -10 -10 -10 -10 -10 -10 -10 -10
$chr3
[1] 10 10 10 10 10 10 10 10 10 10
> as.data.frame(a) + as.data.frame(b)
chr1 chr2 chr3
1 0 -10 10
2 0 -10 10
3 0 -10 10
4 0 -10 10
5 0 -10 10
6 0 -10 10
7 0 -10 10
8 0 -10 10
9 0 -10 10
10 0 -10 10
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
If the elements of a and b were aligned by the names attributes, then
all of the values would have been 0. There are pros and cons to making
Sequence treat names differently than vector. On the pro side, promoting
names to a true metadata attribute would avoid the problem you describe.
On the con side, this complicates the user's software design because
they can no longer simply implement code for Sequence and its subclasses
as they would for vector and expect to obtain the same conceptual result.
The GenomeData class is an interesting beast because it was conceptually
designed as a list with some metadata laid on top to indicate what
genome the data are related to. There is currently nothing in the class
that guarantees the names for the list correspond to the names of
chromosomes for the particular genome in question or even if the list
elements have names at all:
> GenomeData(list(1:10, 1:100))
A GenomeData instance
chromosomes(2): 1 2
> names(GenomeData(list(1:10, 1:100)))
NULL
> validObject(GenomeData(list(1:10, 1:100)))
[1] TRUE
Therefore methods for GenomeData that assume the elements are named are
not guaranteed to work for all valid GenomeData objects. I could change
this by requiring GenomeData to have element names, but I need to talk
to the class designers to see if this is what they intended and if it
would break workflows they have designed.
Based on what I have been seeing so far, it appears that GenomeData and
RangedData are dually serving in a role similar to what
AnnotatedDataFrame has in the microarray space. We just need to find the
right mix of flexibility (so people can stores different data types) and
restrictions (so we can define methods that perform non-trivial
operations) for these two classes so the users don't have to spend
needless amounts of time creating clever workarounds if we are too
restrictive or reinventing the wheel if we are not restrictive enough.
Patrick
Simon Anders wrote:
> Hi Patrick
>
> Patrick Aboyoun wrote:
>
>> I worked on beefing up the low-level IRanges Sequence class, which has
>> around 80 subclasses including GenomeData, GenomeDataList, RleList, to
>> include functional programming tools like Reduce, Filter, Find, Map,
>>
> [...]
>
> Thanks, this looks quite useful.
>
>
> I've just played with it and found one issue with chromsome names.
>
> It seems that you Map function finds corresponding Rle vectors in two
> GenomeData objects by looking at the list indices but not at the names.
>
> Here you have two GenomeData objects which have the same values but
> swapped chromosome names:
>
>
>> gd1 <- GenomeData( list(
>>
> chrA = Rle( rep( c(0,3,0), each=20 ) ),
> chrB = Rle( rep( c(0,2,0), each=30 ) ) ) )
>
>
>> gd2 <- GenomeData( list(
>>
> chrB = Rle( rep( c(0,3,0), each=20 ) ),
> chrA = Rle( rep( c(0,2,0), each=30 ) ) ) )
>
> If I add them up, I do not add chromosome A to chromosome A, but the
> first chromsome to the first one:
>
>
>> Map( "+", gd1, gd2 )
>>
> $chrA
> 'numeric' Rle instance of length 60 with 3 runs
> Lengths: 20 20 20
> Values : 0 6 0
>
> $chrB
> 'numeric' Rle instance of length 90 with 3 runs
> Lengths: 30 30 30
> Values : 0 4 0
>
> This is of course consistent with what base::Map does but violated the
> semantics of the GenomeData object. Imagine you use 'coverage' on two
> sets of aligned reads, and for some reasons, one chromosome never
> appears in the first set, and another one never appears in the second
> one. You will have two GenomeData objects with the same number of
> chromosomes, and their sum will be completely wrong.
>
>
> Martin's gdreduce function, as I've just noticed, does not have this issue:
>
>
>> str( gdreduce( "+", gd1, gd2 ) )
>>
> Formal class 'GenomeData' [package "BSgenome"] with 4 slots
> ..@ listData :List of 2
> .. ..$ chrA:Formal class 'Rle' [package "IRanges"] with 5 slots
> .. .. .. ..@ values : num [1:6] 0 3 5 2 0 3
> .. .. .. ..@ lengths : int [1:6] 20 10 10 20 20 10
> .. .. .. ..@ elementMetadata: NULL
> .. .. .. ..@ elementType : chr "ANYTHING"
> .. .. .. ..@ metadata : list()
> .. ..$ chrB:Formal class 'Rle' [package "IRanges"] with 5 slots
> .. .. .. ..@ values : num [1:6] 0 3 5 2 0 3
> .. .. .. ..@ lengths : int [1:6] 20 10 10 20 20 10
> .. .. .. ..@ elementMetadata: NULL
> .. .. .. ..@ elementType : chr "ANYTHING"
> .. .. .. ..@ metadata : list()
> ..@ elementMetadata: NULL
> ..@ elementType : chr "ANYTHING"
> ..@ metadata :List of 3
> .. ..$ organism : NULL
> .. ..$ provider : NULL
> .. ..$ providerVersion: NULL
> Warning messages:
> 1: In f(init, x[[i]]) :
> longer object length is not a multiple of shorter object length
> 2: In f(init, x[[i]]) :
> longer object length is not a multiple of shorter object length
>
>
> Finally: Why did you call it 'map' and Martin called it 'reduce'?
> Actually, I think, 'map' is correct for the two-argument case. See the
> following use-case here for an example of reducing with a map function.
> Assuming that 'lanes' is a list of 'AlignedRead' objects, I expect to
> get the coverage, summed over all these lanes (which are from the same
> condition), by writing something like
>
> coverageOfAllLanes <-
> Reduce(
> function( gd1, gd2 ) Map( "+", gd1, gd2 ),
> lapply( lanes, coverage, width = chromLengths ) )
>
>
> Cheers
> Simon
>
More information about the Bioc-sig-sequencing
mailing list