[Bioc-sig-seq] large BAM files and large BED files

Martin Morgan mtmorgan at fhcrc.org
Tue Sep 20 15:32:05 CEST 2011


On 09/20/2011 05:56 AM, Rene Paradis wrote:
> On Mon, 2011-09-19 at 12:10 -0700, Martin Morgan wrote:
>> On 09/19/2011 11:55 AM, Michael Lawrence wrote:
>>>
>>>
>>> On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan<mtmorgan at fhcrc.org
>>> <mailto:mtmorgan at fhcrc.org>>  wrote:
>>>
>>>      On 09/19/2011 11:26 AM, Rene Paradis wrote:
>>>
>>>          Thanks Martin and Michael for your constructive advices,
>>>
>>>          I used the ScanBamParam object to successfully load a part of
>>>          the Chr1
>>>          from a Bam file via ScanBam. Honestly I do not know what are the
>>>          differences between readGappedAlignments, readBamGappedAlignment and
>>>          ScanBam. The last two of them can take a  ScanBamParam object.
>>>
>>>
>>>      scanBam returns a list-of-lists, it's the most flexible but least
>>>      'user-friendly'.
>>>
>>>      readGappedAlignments is meant to be a 'front end' to read
>>>      GappedAlignments from several different sources, and
>>>      readBamGappedAlignments is meant to be one of those sources; usually
>>>      the 'user' would readGappedAlignments.
>>>
>>>
>>>          But I wished I could select the seqname in GRanges to retrieve
>>>          all the
>>>          chr1 (as an example) data from the Bam file. It seems I must
>>>          select a
>>>          range. So I put a value that goes beyond the range of the chr1
>>>          because I
>>>          do not know that range, and I got an<<INTEGER () can only be
>>>          applied to
>>>          a 'integer', not a special>>.
>>>
>>>
>>> Couldn't Rsamtools give something more informative?
>>
>> The info in the original post isn't enough to understand how the error
>> occurs.
>>
> To reproduce the original error:
> choose a 30GB Bam file. and run:
> dataIP<- read.table(fileName,header = TRUE, colClasses -
> c("factor","integer","factor"))
>
> after a while, it will trow the error.
>
>
> the seqinfo above did not work for me, probably I do not have the latest
> devel version of Seqinfo.
>
>>   fl
> [1] "/opt/R/lib/Rsamtools/extdata/ex1.bam"
>> seqinfo(open(BamFile(fl)))
> Error in function (classes, fdef, mtable)  :
>    unable to find an inherited method for function "seqinfo", for
> signature "BamFile"
>>
>
>
> otherwise, we have splitted the Big bam into chromosomes with samtools.
> We were then able to load these smaller files.

Hi Rene --

Here's a BAM file, for example

   library(Rsamtools)
   fl <- system.file("extdata", "ex1.bam", package="Rsamtools")

Get the information about each sequence length from the BAM file, and 
make it a GRanges object, 1 ranges per sequence

   t <- scanBamHeader(fl)[[1]][["targets"]]
   which <- GRanges(names(t), IRanges(1, unname(t)))

Then iterate over the object, either in a 'for' loop

   for (i in seq_along(which)) {
       ## read one chromosome
       param <- ScanBamParam(which=which[i], what=character())
       aln <- readGappedAlignments(fl, param=param)
       ## do more work...
   }

or lapply

   lapply(seq_along(which), function(i, fl, which) {
       param <- ScanBamParam(which=which[i], what=character())
       aln <- readGappedAlignments(fl, param=param)
       ## do more work
       table(width(aln))
   }, fl, which)

Hope that helps,

Martin


>
>
>
>>>
>>>          There must be something I missed that
>>>          could help me doing that.
>>>
>>>
>>>      see ?scanBamHeader, e.g.,
>>>
>>>       >   fl<- system.file("extdata", "ex1.bam", package="Rsamtools")
>>>       >  scanBamHeader(fl)[[1]]$targets
>>>      seq1 seq2
>>>      1575 1584
>>>
>>> Would be nice to have a method for getting a Seqinfo out of a BAM
>>> header. Then one can just coerce that to a GRanges. rtracklayer does the
>>> equivalent for BigWig.
>>
>> In devel,
>>
>>   >  seqinfo(open(BamFile(fl)))
>> Seqinfo of length 2
>> seqnames seqlengths isCircular
>> seq1           1575         NA
>> seq2           1584         NA
>>
>> I think this needs to be updated to deal with recent changes to Seqinfo
>> to store the reference genome (which is sometimes also present in the
>> BAM file).
>>
> I think this update would be useful for us. Thank you for your support
> Martin and Michael.
>
> Rene
>
>> Maritn
>>>
>>> Michael
>>>
>>>      Martin
>>>
>>>
>>>
>>>          ultimately, I want to launch a PICS analysis that requires a
>>>          segReadsList object.
>>>
>>>          Overall I definitely progressed by your help, thank you.
>>>
>>>          Rene
>>>
>>>
>>>
>>>
>>>          On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
>>>
>>>              On 09/16/2011 02:11 PM, Michael Lawrence wrote:
>>>
>>>                  It sounds like you're trying to use BED as an
>>>                  alternative to BAM? Probably
>>>                  not a good idea, especially at this scale. Why are you
>>>                  aiming for a
>>>                  GenomeData? A GappedAlignments might be more
>>>                  appropriate. See
>>>                  GenomicRanges::__readGappedAlignments() for bringing a
>>>                  BAM into a
>>>                  GappedAlignments.
>>>
>>>
>>>              Hi Rene
>>>
>>>              the 'which' argument to readGappedAlignments (it'll become
>>>              'param' with
>>>              the next release, and be a ScanBamParam object) allows you
>>>              to select
>>>              regions to process, e.g., chromosome-at-a-time, to help with
>>>              file size.
>>>
>>>              Martin
>>>
>>>
>>>                  This page might help:
>>>                  http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
>>>                  <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
>>>
>>>                  But it could really be improved.
>>>
>>>                  Michael
>>>
>>>                  On Fri, Sep 16, 2011 at 1:44 PM, Rene
>>>                  Paradis<rene.paradis at genome.__ulaval.ca
>>>                  <mailto:rene.paradis at genome.ulaval.ca>
>>>
>>>                      wrote:
>>>
>>>
>>>                      Hello,
>>>
>>>                      I am experiencing a problem regarding the load in
>>>                      memory of bed files of
>>>                      30 GB. my function read.table unleash the error :
>>>                      Error in unique(x) :
>>>                      length xxxxxx is too large for hashing.
>>>
>>>                      this is generated by the function MKsetup of the
>>>                      unique.c file. Even by
>>>                      increasing by 10 000x the value, the error persists.
>>>                      I believe the
>>>                      function pushes more data in ram, but I am not sure
>>>                      this is the good way
>>>                      to focus on.
>>>
>>>                      Ultimately, I would like to produce a GenomeData
>>>                      object from either a
>>>                      BAM file or a bed file.
>>>
>>>                      has someone ever worked with very very big BAM files
>>>                      (about 30 GB)
>>>
>>>                      thanks
>>>
>>>                      Rene paradis
>>>
>>>                      _________________________________________________
>>>                      Bioc-sig-sequencing mailing list
>>>                      Bioc-sig-sequencing at r-project.__org
>>>                      <mailto:Bioc-sig-sequencing at r-project.org>
>>>                      https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>>>                      <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>>>
>>>
>>>                          [[alternative HTML version deleted]]
>>>
>>>                  _________________________________________________
>>>                  Bioc-sig-sequencing mailing list
>>>                  Bioc-sig-sequencing at r-project.__org
>>>                  <mailto:Bioc-sig-sequencing at r-project.org>
>>>                  https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>>>                  <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>      --
>>>      Computational Biology
>>>      Fred Hutchinson Cancer Research Center
>>>      1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>
>>>      Location: M1-B861
>>>      Telephone: 206 667-2793<tel:206%20667-2793>
>>>
>>>      _________________________________________________
>>>      Bioc-sig-sequencing mailing list
>>>      Bioc-sig-sequencing at r-project.__org
>>>      <mailto:Bioc-sig-sequencing at r-project.org>
>>>      https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>>>      <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>>>
>>>
>>
>>
>
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list