[Bioc-sig-seq] Loading large BAM files

Thu Jul 14 17:47:29 CEST 2011

Hello Martin,

I am currently reading in data with scanBam and making use of the
ScanBamParam function.

My bam data looks like this:

> names(bam1)
[1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783"

> class(bam1)
[1] "list"

> class(bam1$"chr1:1-197195432")
[1] "list"

> class(bam1$"chr1:1-197195432"$qname)
[1] "character"

How do we consolidate all three lists (chr1, chr2, chr3) into a single list?

I read the Rsamtools documentation but the examples do not quite apply
to this case.

Thank you,

Ivan

On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hi Martin,
>
> I will have to play a little bit both with the counter function and
> with simplify2array. It's not evident for me what they are intended to
> do.
>
> I can answer your question about BAM size. Our samples are all from
> Illumina HiSeq now. So, we are talking about no less than 100 million
> good quality tags per lane.
>
> In different occasions I have run into the problem of the 2^31 -1
> limit and I am always thinking about ways to process large amounts of
> information by chunks and in parallel. With time, sequencers will only
> output more tags so we will will run into this problems only more
> often.
>
> The suggestion of using ScanBamParam is good, thank you, but at this
> early stage of sample analysis I need to load the whole BAM. Some
> sample quality assessments are in order.
>
> Thank you,
>
> Ivan
>
>
>
>
> On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 07/13/2011 01:57 PM, Martin Morgan wrote:
>>>
>>> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
>>>>
>>>> Hi everybody,
>>>>
>>>> As I wait for my large BAM to be read in by scanBAM, I can't help but
>>>> to wonder:
>>>>
>>>> Has anybody tried combining scanBam with multicore to load the
>>>> chromosomes in parallel?
>>>>
>>>> That would require
>>>>
>>>> 1) to merge the chunks at the end and
>>>>
>>>> 2) the original BAM to be indexed.
>>>>
>>>> Does anybody have any experience to share?
>>>
>>> Was wondering how large and long we're talking about?
>>>
>>> Use of ScanBamParam(what=...) can help.
>>>
>>> For some tasks I'd think of a coarser granularity, e.g., in the context
>>> of multiple bam files so that the data reduction (to a vector of
>>> 10,000's of counts) occurs on each core.
>>>
>>> counter = function(fl, genes) {
>>> aln = readGappedAlignments(fl)
>>> strand(aln) = "*"
>>> hits = countOverlaps(aln, genes)
>>> countOverlaps(genes, aln[hits==1])
>>> }
>>> simplify2array(mclapply(bamFiles, counter, genes))
>>>
>>> One issue I understand people have is that mclapply uses 'serialize()'
>>> to convert the return value of each function to a raw vector. raw
>>> vectors have the same total length limit as any other vector (2^31 -1
>>> elements) and this places a limit on the size of chunk returned by each
>>> core. Also I believe that exceeding the limit can silently corrupt the
>>> data (i.e., a bug). This is second-hand information.
>>
>> Should also have mentioned that casting a GRanges object to RangesList
>> provides the appropriate list to iterate over chromosomes, and that
>> ScanBamParam will accept a which of class RangesList.
>>
>> Martin
>>
>>>
>>> Martin
>>>
>>>>
>>>> Thank you,
>>>>
>>>> Ivan
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>>
>