[Bioc-sig-seq] Loading large BAM files
Martin Morgan
mtmorgan at fhcrc.org
Wed Jul 13 22:57:45 CEST 2011
On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
> Hi everybody,
>
> As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder:
>
> Has anybody tried combining scanBam with multicore to load the
> chromosomes in parallel?
>
> That would require
>
> 1) to merge the chunks at the end and
>
> 2) the original BAM to be indexed.
>
> Does anybody have any experience to share?
Was wondering how large and long we're talking about?
Use of ScanBamParam(what=...) can help.
For some tasks I'd think of a coarser granularity, e.g., in the context
of multiple bam files so that the data reduction (to a vector of
10,000's of counts) occurs on each core.
counter = function(fl, genes) {
aln = readGappedAlignments(fl)
strand(aln) = "*"
hits = countOverlaps(aln, genes)
countOverlaps(genes, aln[hits==1])
}
simplify2array(mclapply(bamFiles, counter, genes))
One issue I understand people have is that mclapply uses 'serialize()'
to convert the return value of each function to a raw vector. raw
vectors have the same total length limit as any other vector (2^31 -1
elements) and this places a limit on the size of chunk returned by each
core. Also I believe that exceeding the limit can silently corrupt the
data (i.e., a bug). This is second-hand information.
Martin
>
> Thank you,
>
> Ivan
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list