[Bioc-sig-seq] Loading large BAM files

Ivan Gregoretti ivangreg at gmail.com
Wed Jul 13 23:25:51 CEST 2011


Hi Martin,

I will have to play a little bit both with the counter function and
with simplify2array. It's not evident for me what they are intended to
do.

I can answer your question about BAM size. Our samples are all from
Illumina HiSeq now. So, we are talking about no less than 100 million
good quality tags per lane.

In different occasions I have run into the problem of the 2^31 -1
limit and I am always thinking about ways to process large amounts of
information by chunks and in parallel. With time, sequencers will only
output more tags so we will will run into this problems only more
often.

The suggestion of using ScanBamParam is good, thank you, but at this
early stage of sample analysis I need to load the whole BAM. Some
sample quality assessments are in order.

Thank you,

Ivan




On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 07/13/2011 01:57 PM, Martin Morgan wrote:
>>
>> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
>>>
>>> Hi everybody,
>>>
>>> As I wait for my large BAM to be read in by scanBAM, I can't help but
>>> to wonder:
>>>
>>> Has anybody tried combining scanBam with multicore to load the
>>> chromosomes in parallel?
>>>
>>> That would require
>>>
>>> 1) to merge the chunks at the end and
>>>
>>> 2) the original BAM to be indexed.
>>>
>>> Does anybody have any experience to share?
>>
>> Was wondering how large and long we're talking about?
>>
>> Use of ScanBamParam(what=...) can help.
>>
>> For some tasks I'd think of a coarser granularity, e.g., in the context
>> of multiple bam files so that the data reduction (to a vector of
>> 10,000's of counts) occurs on each core.
>>
>> counter = function(fl, genes) {
>> aln = readGappedAlignments(fl)
>> strand(aln) = "*"
>> hits = countOverlaps(aln, genes)
>> countOverlaps(genes, aln[hits==1])
>> }
>> simplify2array(mclapply(bamFiles, counter, genes))
>>
>> One issue I understand people have is that mclapply uses 'serialize()'
>> to convert the return value of each function to a raw vector. raw
>> vectors have the same total length limit as any other vector (2^31 -1
>> elements) and this places a limit on the size of chunk returned by each
>> core. Also I believe that exceeding the limit can silently corrupt the
>> data (i.e., a bug). This is second-hand information.
>
> Should also have mentioned that casting a GRanges object to RangesList
> provides the appropriate list to iterate over chromosomes, and that
> ScanBamParam will accept a which of class RangesList.
>
> Martin
>
>>
>> Martin
>>
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>



More information about the Bioc-sig-sequencing mailing list