[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes
Patrick Aboyoun
paboyoun at fhcrc.org
Wed Jul 1 05:01:21 CEST 2009
Simon,
I checked in a speedup for coverage calculations to the IRanges
package. It should be about 30 - 50% faster now. On my laptop, it took
around 50 seconds to calculate the coverage of 29 million ranges over
a 1e8 sequence domain on my laptop. Under the old coverage code, this
calculation took about 100 seconds.
> suppressMessages(library(IRanges))
> N <- 29e6
> set.seed(0)
> x <- IRanges(start=sample(1e8 - 36, N, replace = TRUE), width = 36)
> system.time(coverage(x, width = 1e8))
user system elapsed
49.747 4.262 54.305
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.3.28
Patrick
Quoting Simon Anders <anders at ebi.ac.uk>:
> Hi
>
> Patrick Aboyoun wrote:
>> Simon,
>> Could you provide some profiling information to show where the
>> bottlenecks are?
>
> I don't know if there is really a clear bottleneck. 9 minutes to
> calculate the coverage of 29 mio reads is 20 seconds per mio reads;
> this is probably what the coverage function always needed. So, in the
> code given in my mail, the summing up of the GenomeData objects is just
> awkward but not a performance penalty.
>
>> I am also wondering if I should be building up the
>> functionality for RleList, which could have `+` and other Math
>> operations. We have a lot of classes in the Sequence space and it is
>> not clear yet which classes are going to be part of the winning
>> solution.
>
> I'd say that this is the main issue. I discover new classes every day.
> You just mentioned 'RleList', Michael mentions 'GenomeDataList', and
> Martin has another way to go again.
>
> I'm sorry to say that, at least for me, this has become hopelessly
> confusing, and I imagine that many other users fell the same. You write
> that "it is not clear yet which classes are going to be part of the
> winning solution" and I completely agree that it makes more sense to
> have a few good classes rather than adding functionality to any class
> on demand. So, maybe don't bother with a `+` operation for now.
>
> Best regards
> Simon
More information about the Bioc-sig-sequencing
mailing list