[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes
Simon Anders
anders at ebi.ac.uk
Sun Jul 5 16:16:13 CEST 2009
Hi,
sorry for sending so many mails, but my thought process seems to be a
bit iterative today.
I'd like to add to my previous mail with an observation regarding
performance.
Let's say we define a group generic as suggested:
Simon Anders wrote:
> What about adding a group generic as follows?
>
> setMethod( "Arith", signature( e1 = "GenomeData", e2 = "GenomeData" ),
> function( e1, e2 )
> gdreduce( # or: GenomeData( Map(
> function( r1, r2 ) get(.Generic)( r1, r2 ),
> e1, e2 ) )
I'll now take the Braski et al. H3K4me1 data from my vignette, and after
some filtering, I'll work with the variable 'lanes', which is a list of
seven AlignedData objects, witha total of 27 mio reads:
> sapply( lanes, length )
run13_lane4.map run15_lane2.map run16_lane2.map run16_lane3.map
839361 3140651 3867821 3934167
run2_lane6.map run2_lane7.map run2_lane8.map
4930304 5588437 5092628
> sum( sapply( lanes, length ) )
[1] 27393369
Given the `+` method defined above, the natural way to get the coverage
of all these lanes together is the following simple line:
cvg1 <- Reduce( `+`, lapply( lanes, coverage, width=seqlens ) )
This is elegant and standard functional-programming style, but it is
slow, compared to a for loop:
> system.time(
+ cvg1 <- Reduce( `+`, lapply( lanes, coverage, width=seqlens ) )
+ )
user system elapsed
374.873 104.028 478.975
> system.time({
+ cvg2 <- GenomeData( list() )
+ for( i in length(lanes) )
+ cvg2 <- cvg2 + coverage( lanes[[i]], width=seqlens )
+ })
user system elapsed
56.551 9.050 65.634
Somehow, this is not very satisfying. To me, it suggests that one really
needs in-place operations to deal with such huge amounts of data.
Simon
More information about the Bioc-sig-sequencing
mailing list