[Bioc-sig-seq] large BAM files and large BED files

Wed Sep 21 17:33:10 CEST 2011

On Tue, 2011-09-20 at 06:32 -0700, Martin Morgan wrote:
> On 09/20/2011 05:56 AM, Rene Paradis wrote:
> > On Mon, 2011-09-19 at 12:10 -0700, Martin Morgan wrote:
> >> On 09/19/2011 11:55 AM, Michael Lawrence wrote:
> >>>
> >>>
> >>> On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan<mtmorgan at fhcrc.org
> >>> <mailto:mtmorgan at fhcrc.org>>  wrote:
> >>>
> >>>      On 09/19/2011 11:26 AM, Rene Paradis wrote:
> >>>
> >>>          Thanks Martin and Michael for your constructive advices,
> >>>
> >>>          I used the ScanBamParam object to successfully load a part of
> >>>          the Chr1
> >>>          from a Bam file via ScanBam. Honestly I do not know what are the
> >>>          differences between readGappedAlignments, readBamGappedAlignment and
> >>>          ScanBam. The last two of them can take a  ScanBamParam object.
> >>>
> >>>
> >>>      scanBam returns a list-of-lists, it's the most flexible but least
> >>>      'user-friendly'.
> >>>
> >>>      readGappedAlignments is meant to be a 'front end' to read
> >>>      GappedAlignments from several different sources, and
> >>>      readBamGappedAlignments is meant to be one of those sources; usually
> >>>      the 'user' would readGappedAlignments.
> >>>
> >>>
> >>>          But I wished I could select the seqname in GRanges to retrieve
> >>>          all the
> >>>          chr1 (as an example) data from the Bam file. It seems I must
> >>>          select a
> >>>          range. So I put a value that goes beyond the range of the chr1
> >>>          because I
> >>>          do not know that range, and I got an<<INTEGER () can only be
> >>>          applied to
> >>>          a 'integer', not a special>>.
> >>>
> >>>
> >>> Couldn't Rsamtools give something more informative?
> >>
> >> The info in the original post isn't enough to understand how the error
> >> occurs.
> >>
> > To reproduce the original error:
> > choose a 30GB Bam file. and run:
> > dataIP<- read.table(fileName,header = TRUE, colClasses -
> > c("factor","integer","factor"))
> >
> > after a while, it will trow the error.
> >
> >
> > the seqinfo above did not work for me, probably I do not have the latest
> > devel version of Seqinfo.
> >
> >>   fl
> > [1] "/opt/R/lib/Rsamtools/extdata/ex1.bam"
> >> seqinfo(open(BamFile(fl)))
> > Error in function (classes, fdef, mtable)  :
> >    unable to find an inherited method for function "seqinfo", for
> > signature "BamFile"
> >>
> >
> >
> > otherwise, we have splitted the Big bam into chromosomes with samtools.
> > We were then able to load these smaller files.
> 
> Hi Rene --
> 
> Here's a BAM file, for example
> 
>    library(Rsamtools)
>    fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> 
> Get the information about each sequence length from the BAM file, and 
> make it a GRanges object, 1 ranges per sequence
> 
>    t <- scanBamHeader(fl)[[1]][["targets"]]
>    which <- GRanges(names(t), IRanges(1, unname(t)))
> 
> Then iterate over the object, either in a 'for' loop
> 
>    for (i in seq_along(which)) {
>        ## read one chromosome
>        param <- ScanBamParam(which=which[i], what=character())
>        aln <- readGappedAlignments(fl, param=param)
>        ## do more work...
>    }
> 
> or lapply
> 
>    lapply(seq_along(which), function(i, fl, which) {
>        param <- ScanBamParam(which=which[i], what=character())
>        aln <- readGappedAlignments(fl, param=param)
>        ## do more work
>        table(width(aln))
>    }, fl, which)
> 
> Hope that helps,
> 
> Martin
> 
Thanks Martin, your advices were really useful.

Here is what I did:

t <- scanBamHeader("example.bam")[[1]][["targets"]]
which <- GRanges(names(t)[22], IRanges(1, unname(t)[22]))
param <- ScanBamParam(which=which, what=character())
aln <- readGappedAlignments("example.bam", param=param)
Error in validObject(.Object) :
   invalid class "GappedAlignments" objects: 'ranges' contains values
outside of sequence bounds

This bam file comes from an Illumina sequencing machine and it is human
genome.

I think now this is not a problem regarding a large BAM file, but a
problem in the bam file itself.

Rene

> 
> >
> >
> >
> >>>
> >>>          There must be something I missed that
> >>>          could help me doing that.
> >>>
> >>>
> >>>      see ?scanBamHeader, e.g.,
> >>>
> >>>       >   fl<- system.file("extdata", "ex1.bam", package="Rsamtools")
> >>>       >  scanBamHeader(fl)[[1]]$targets
> >>>      seq1 seq2
> >>>      1575 1584
> >>>
> >>> Would be nice to have a method for getting a Seqinfo out of a BAM
> >>> header. Then one can just coerce that to a GRanges. rtracklayer does the
> >>> equivalent for BigWig.
> >>
> >> In devel,
> >>
> >>   >  seqinfo(open(BamFile(fl)))
> >> Seqinfo of length 2
> >> seqnames seqlengths isCircular
> >> seq1           1575         NA
> >> seq2           1584         NA
> >>
> >> I think this needs to be updated to deal with recent changes to Seqinfo
> >> to store the reference genome (which is sometimes also present in the
> >> BAM file).
> >>
> > I think this update would be useful for us. Thank you for your support
> > Martin and Michael.
> >
> > Rene
> >
> >> Maritn
> >>>
> >>> Michael
> >>>
> >>>      Martin
> >>>
> >>>
> >>>
> >>>          ultimately, I want to launch a PICS analysis that requires a
> >>>          segReadsList object.
> >>>
> >>>          Overall I definitely progressed by your help, thank you.
> >>>
> >>>          Rene
> >>>
> >>>
> >>>
> >>>
> >>>          On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
> >>>
> >>>              On 09/16/2011 02:11 PM, Michael Lawrence wrote:
> >>>
> >>>                  It sounds like you're trying to use BED as an
> >>>                  alternative to BAM? Probably
> >>>                  not a good idea, especially at this scale. Why are you
> >>>                  aiming for a
> >>>                  GenomeData? A GappedAlignments might be more
> >>>                  appropriate. See
> >>>                  GenomicRanges::__readGappedAlignments() for bringing a
> >>>                  BAM into a
> >>>                  GappedAlignments.
> >>>
> >>>
> >>>              Hi Rene
> >>>
> >>>              the 'which' argument to readGappedAlignments (it'll become
> >>>              'param' with
> >>>              the next release, and be a ScanBamParam object) allows you
> >>>              to select
> >>>              regions to process, e.g., chromosome-at-a-time, to help with
> >>>              file size.
> >>>
> >>>              Martin
> >>>
> >>>
> >>>                  This page might help:
> >>>                  http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
> >>>                  <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
> >>>
> >>>                  But it could really be improved.
> >>>
> >>>                  Michael
> >>>
> >>>                  On Fri, Sep 16, 2011 at 1:44 PM, Rene
> >>>                  Paradis<rene.paradis at genome.__ulaval.ca
> >>>                  <mailto:rene.paradis at genome.ulaval.ca>
> >>>
> >>>                      wrote:
> >>>
> >>>
> >>>                      Hello,
> >>>
> >>>                      I am experiencing a problem regarding the load in
> >>>                      memory of bed files of
> >>>                      30 GB. my function read.table unleash the error :
> >>>                      Error in unique(x) :
> >>>                      length xxxxxx is too large for hashing.
> >>>
> >>>                      this is generated by the function MKsetup of the
> >>>                      unique.c file. Even by
> >>>                      increasing by 10 000x the value, the error persists.
> >>>                      I believe the
> >>>                      function pushes more data in ram, but I am not sure
> >>>                      this is the good way
> >>>                      to focus on.
> >>>
> >>>                      Ultimately, I would like to produce a GenomeData
> >>>                      object from either a
> >>>                      BAM file or a bed file.
> >>>
> >>>                      has someone ever worked with very very big BAM files
> >>>                      (about 30 GB)
> >>>
> >>>                      thanks
> >>>
> >>>                      Rene paradis
> >>>
> >>>                      _________________________________________________
> >>>                      Bioc-sig-sequencing mailing list
> >>>                      Bioc-sig-sequencing at r-project.__org
> >>>                      <mailto:Bioc-sig-sequencing at r-project.org>
> >>>                      https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >>>                      <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >>>
> >>>
> >>>                          [[alternative HTML version deleted]]
> >>>
> >>>                  _________________________________________________
> >>>                  Bioc-sig-sequencing mailing list
> >>>                  Bioc-sig-sequencing at r-project.__org
> >>>                  <mailto:Bioc-sig-sequencing at r-project.org>
> >>>                  https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >>>                  <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>      --
> >>>      Computational Biology
> >>>      Fred Hutchinson Cancer Research Center
> >>>      1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >>>
> >>>      Location: M1-B861
> >>>      Telephone: 206 667-2793<tel:206%20667-2793>
> >>>
> >>>      _________________________________________________
> >>>      Bioc-sig-sequencing mailing list
> >>>      Bioc-sig-sequencing at r-project.__org
> >>>      <mailto:Bioc-sig-sequencing at r-project.org>
> >>>      https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >>>      <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >>>
> >>>
> >>
> >>
> >
> >
> 
>