[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes
Martin Morgan
mtmorgan at fhcrc.org
Tue Jun 30 18:49:48 CEST 2009
Michael Lawrence wrote:
> On Tue, Jun 30, 2009 at 8:32 AM, Simon Anders <anders at ebi.ac.uk> wrote:
>
>> Hi Michael et al.
>>
>> Assume I have a list of AlignedRead objects, with data from several Solexa
>> lanes. I would like to get a coverage vector over all the lanes.
>>
>> As 'coverage' takes only one AlignedRead object, I have two possibilities:
>>
>> (a) Concatenate the AlignedRead objects to a single big one. As far as I
>> can see, 'rbind2' is not implemented for AlignedRead, and 'append' seemed
>> very slow to me. It is probably not posible to do this without making a copy
>> of all the data in memory.
>>
>
> I think this is what combineLanes() in the chipseq package does, and it
> seems to work fine for us. Note that it takes a GenomeDataList, rather than
> a list of AlignedReads, but ShortRead makes it easy to get a GenomeDataList,
> if I remember right.
GenomeData is under-specified in terms of its content, and combineLanes
is I think expecting distinct plus and minus strand information. This
also makes it difficult to implement a '+' on GenomeData.
I have a "Reduce"-like function I'll add to BSgenome later today.
Martin
> Of course, optimization is welcome.
>
>
>> (b) Calculate the coverage for each AlignedRead object separately and add
>> up the GenomeData objects. The `+` operator is not supported for GenomeData
>> objects but it is for Rle objects. So I need to loop through the
>> chromosomes.
>>
>
>> I've now written this short function for the purpose:
>>
>> sumUpCoverage <- function( lanes, seqLens )
>> {
>> res <- NULL
>> for( i in 1:length(lanes) ) {
>> cvg <- coverage( lanes, width=seqLens )
>> if( is.null( res ) )
>> res <- cvg
>> else {
>> stopifnot( all( names(res) == names(cvg) ) )
>> for( seq in names(res) )
>> res[[seq]] <- res[[seq]] + cvg[[seq]]
>> }
>> }
>> res
>> }
>>
>> This does not seem very elegant (and it takes 9 minutes, which, however,
>> might be ok for 29 mio reads). Any idea how to do it better? (The use of
>> 'for' instead of 'sapply' is on purpose: I hope to save memory that way.)
>>
>> And would it make sense to overload the `+` operator for GenomeData
>> objects? If so, could I suggest adding this to BSgenome?
>>
>> Cheers
>> Simon
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list