[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes
Patrick Aboyoun
paboyoun at fhcrc.org
Sun Jul 5 06:42:56 CEST 2009
Simon,
I have a further update. Martin and I have added methods to perform
the "+" operation for GenomeData objects containing coverage
information in succinct function calls.
I worked on beefing up the low-level IRanges Sequence class, which has
around 80 subclasses including GenomeData, GenomeDataList, RleList, to
include functional programming tools like Reduce, Filter, Find, Map,
Position, and mapply. These methods behave like the standard methods
in base so if you have two GenomeData objects containing Rle objects,
you can use the Map function with f = "+" to perform element-wise
addition. As with the Map function from base, the Map method for
Sequence returns a list object. Martin has also added a gdreduce
function to BSgenome that behaves like the Map function except it
returns a GenomeData object. I need to talk to Martin to see if he
sees a need for a gdreduce function when we can simply make a Map
method for GenomeData objects that does the same thing. My preference
is to replace gdreduce with a Map method for GenomeData since there
isn't much benefit to having two functions that are designed to
perform the same operation.
> suppressMessages(library(BSgenome))
> x <- GenomeData(list(chr1 = Rle(1:10), chr2 = Rle(10:1)))
> Map("+", x, x)
$chr1
'integer' Rle instance of length 10 with 10 runs
Lengths: 1 1 1 1 1 1 1 1 1 1
Values : 2 4 6 8 10 12 14 16 18 20
$chr2
'integer' Rle instance of length 10 with 10 runs
Lengths: 1 1 1 1 1 1 1 1 1 1
Values : 20 18 16 14 12 10 8 6 4 2
> gdreduce("+", x, x)
A GenomeData instance
chromosomes(2): chr1 chr2
> gdreduce("+", x, x)[[1]]
'integer' Rle instance of length 10 with 10 runs
Lengths: 1 1 1 1 1 1 1 1 1 1
Values : 2 4 6 8 10 12 14 16 18 20
> gdreduce("+", x, x)[[2]]
'integer' Rle instance of length 10 with 10 runs
Lengths: 1 1 1 1 1 1 1 1 1 1
Values : 20 18 16 14 12 10 8 6 4 2
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome_1.13.8 Biostrings_2.13.23 IRanges_1.3.30
loaded via a namespace (and not attached):
[1] Biobase_2.5.4
Quoting Patrick Aboyoun <paboyoun at fhcrc.org>:
> Simon,
> I checked in a speedup for coverage calculations to the IRanges
> package. It should be about 30 - 50% faster now. On my laptop, it took
> around 50 seconds to calculate the coverage of 29 million ranges over a
> 1e8 sequence domain on my laptop. Under the old coverage code, this
> calculation took about 100 seconds.
>
>> suppressMessages(library(IRanges))
>> N <- 29e6
>> set.seed(0)
>> x <- IRanges(start=sample(1e8 - 36, N, replace = TRUE), width = 36)
>> system.time(coverage(x, width = 1e8))
> user system elapsed
> 49.747 4.262 54.305
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
> i386-apple-darwin9.7.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] IRanges_1.3.28
>
>
>
> Patrick
>
>
> Quoting Simon Anders <anders at ebi.ac.uk>:
>
>> Hi
>>
>> Patrick Aboyoun wrote:
>>> Simon,
>>> Could you provide some profiling information to show where the
>>> bottlenecks are?
>>
>> I don't know if there is really a clear bottleneck. 9 minutes to
>> calculate the coverage of 29 mio reads is 20 seconds per mio reads;
>> this is probably what the coverage function always needed. So, in the
>> code given in my mail, the summing up of the GenomeData objects is just
>> awkward but not a performance penalty.
>>
>>> I am also wondering if I should be building up the
>>> functionality for RleList, which could have `+` and other Math
>>> operations. We have a lot of classes in the Sequence space and it is
>>> not clear yet which classes are going to be part of the winning
>>> solution.
>>
>> I'd say that this is the main issue. I discover new classes every day.
>> You just mentioned 'RleList', Michael mentions 'GenomeDataList', and
>> Martin has another way to go again.
>>
>> I'm sorry to say that, at least for me, this has become hopelessly
>> confusing, and I imagine that many other users fell the same. You write
>> that "it is not clear yet which classes are going to be part of the
>> winning solution" and I completely agree that it makes more sense to
>> have a few good classes rather than adding functionality to any class
>> on demand. So, maybe don't bother with a `+` operation for now.
>>
>> Best regards
>> Simon
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list