[Bioc-sig-seq] large BAM files and large BED files
Rene Paradis
rene.paradis at genome.ulaval.ca
Wed Sep 21 17:33:10 CEST 2011
On Tue, 2011-09-20 at 06:32 -0700, Martin Morgan wrote:
> On 09/20/2011 05:56 AM, Rene Paradis wrote:
> > On Mon, 2011-09-19 at 12:10 -0700, Martin Morgan wrote:
> >> On 09/19/2011 11:55 AM, Michael Lawrence wrote:
> >>>
> >>>
> >>> On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan<mtmorgan at fhcrc.org
> >>> <mailto:mtmorgan at fhcrc.org>> wrote:
> >>>
> >>> On 09/19/2011 11:26 AM, Rene Paradis wrote:
> >>>
> >>> Thanks Martin and Michael for your constructive advices,
> >>>
> >>> I used the ScanBamParam object to successfully load a part of
> >>> the Chr1
> >>> from a Bam file via ScanBam. Honestly I do not know what are the
> >>> differences between readGappedAlignments, readBamGappedAlignment and
> >>> ScanBam. The last two of them can take a ScanBamParam object.
> >>>
> >>>
> >>> scanBam returns a list-of-lists, it's the most flexible but least
> >>> 'user-friendly'.
> >>>
> >>> readGappedAlignments is meant to be a 'front end' to read
> >>> GappedAlignments from several different sources, and
> >>> readBamGappedAlignments is meant to be one of those sources; usually
> >>> the 'user' would readGappedAlignments.
> >>>
> >>>
> >>> But I wished I could select the seqname in GRanges to retrieve
> >>> all the
> >>> chr1 (as an example) data from the Bam file. It seems I must
> >>> select a
> >>> range. So I put a value that goes beyond the range of the chr1
> >>> because I
> >>> do not know that range, and I got an<<INTEGER () can only be
> >>> applied to
> >>> a 'integer', not a special>>.
> >>>
> >>>
> >>> Couldn't Rsamtools give something more informative?
> >>
> >> The info in the original post isn't enough to understand how the error
> >> occurs.
> >>
> > To reproduce the original error:
> > choose a 30GB Bam file. and run:
> > dataIP<- read.table(fileName,header = TRUE, colClasses -
> > c("factor","integer","factor"))
> >
> > after a while, it will trow the error.
> >
> >
> > the seqinfo above did not work for me, probably I do not have the latest
> > devel version of Seqinfo.
> >
> >> fl
> > [1] "/opt/R/lib/Rsamtools/extdata/ex1.bam"
> >> seqinfo(open(BamFile(fl)))
> > Error in function (classes, fdef, mtable) :
> > unable to find an inherited method for function "seqinfo", for
> > signature "BamFile"
> >>
> >
> >
> > otherwise, we have splitted the Big bam into chromosomes with samtools.
> > We were then able to load these smaller files.
>
> Hi Rene --
>
> Here's a BAM file, for example
>
> library(Rsamtools)
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>
> Get the information about each sequence length from the BAM file, and
> make it a GRanges object, 1 ranges per sequence
>
> t <- scanBamHeader(fl)[[1]][["targets"]]
> which <- GRanges(names(t), IRanges(1, unname(t)))
>
> Then iterate over the object, either in a 'for' loop
>
> for (i in seq_along(which)) {
> ## read one chromosome
> param <- ScanBamParam(which=which[i], what=character())
> aln <- readGappedAlignments(fl, param=param)
> ## do more work...
> }
>
> or lapply
>
> lapply(seq_along(which), function(i, fl, which) {
> param <- ScanBamParam(which=which[i], what=character())
> aln <- readGappedAlignments(fl, param=param)
> ## do more work
> table(width(aln))
> }, fl, which)
>
> Hope that helps,
>
> Martin
>
Thanks Martin, your advices were really useful.
Here is what I did:
t <- scanBamHeader("example.bam")[[1]][["targets"]]
which <- GRanges(names(t)[22], IRanges(1, unname(t)[22]))
param <- ScanBamParam(which=which, what=character())
aln <- readGappedAlignments("example.bam", param=param)
Error in validObject(.Object) :
invalid class "GappedAlignments" objects: 'ranges' contains values
outside of sequence bounds
This bam file comes from an Illumina sequencing machine and it is human
genome.
I think now this is not a problem regarding a large BAM file, but a
problem in the bam file itself.
Rene
>
> >
> >
> >
> >>>
> >>> There must be something I missed that
> >>> could help me doing that.
> >>>
> >>>
> >>> see ?scanBamHeader, e.g.,
> >>>
> >>> > fl<- system.file("extdata", "ex1.bam", package="Rsamtools")
> >>> > scanBamHeader(fl)[[1]]$targets
> >>> seq1 seq2
> >>> 1575 1584
> >>>
> >>> Would be nice to have a method for getting a Seqinfo out of a BAM
> >>> header. Then one can just coerce that to a GRanges. rtracklayer does the
> >>> equivalent for BigWig.
> >>
> >> In devel,
> >>
> >> > seqinfo(open(BamFile(fl)))
> >> Seqinfo of length 2
> >> seqnames seqlengths isCircular
> >> seq1 1575 NA
> >> seq2 1584 NA
> >>
> >> I think this needs to be updated to deal with recent changes to Seqinfo
> >> to store the reference genome (which is sometimes also present in the
> >> BAM file).
> >>
> > I think this update would be useful for us. Thank you for your support
> > Martin and Michael.
> >
> > Rene
> >
> >> Maritn
> >>>
> >>> Michael
> >>>
> >>> Martin
> >>>
> >>>
> >>>
> >>> ultimately, I want to launch a PICS analysis that requires a
> >>> segReadsList object.
> >>>
> >>> Overall I definitely progressed by your help, thank you.
> >>>
> >>> Rene
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
> >>>
> >>> On 09/16/2011 02:11 PM, Michael Lawrence wrote:
> >>>
> >>> It sounds like you're trying to use BED as an
> >>> alternative to BAM? Probably
> >>> not a good idea, especially at this scale. Why are you
> >>> aiming for a
> >>> GenomeData? A GappedAlignments might be more
> >>> appropriate. See
> >>> GenomicRanges::__readGappedAlignments() for bringing a
> >>> BAM into a
> >>> GappedAlignments.
> >>>
> >>>
> >>> Hi Rene
> >>>
> >>> the 'which' argument to readGappedAlignments (it'll become
> >>> 'param' with
> >>> the next release, and be a ScanBamParam object) allows you
> >>> to select
> >>> regions to process, e.g., chromosome-at-a-time, to help with
> >>> file size.
> >>>
> >>> Martin
> >>>
> >>>
> >>> This page might help:
> >>> http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
> >>> <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
> >>>
> >>> But it could really be improved.
> >>>
> >>> Michael
> >>>
> >>> On Fri, Sep 16, 2011 at 1:44 PM, Rene
> >>> Paradis<rene.paradis at genome.__ulaval.ca
> >>> <mailto:rene.paradis at genome.ulaval.ca>
> >>>
> >>> wrote:
> >>>
> >>>
> >>> Hello,
> >>>
> >>> I am experiencing a problem regarding the load in
> >>> memory of bed files of
> >>> 30 GB. my function read.table unleash the error :
> >>> Error in unique(x) :
> >>> length xxxxxx is too large for hashing.
> >>>
> >>> this is generated by the function MKsetup of the
> >>> unique.c file. Even by
> >>> increasing by 10 000x the value, the error persists.
> >>> I believe the
> >>> function pushes more data in ram, but I am not sure
> >>> this is the good way
> >>> to focus on.
> >>>
> >>> Ultimately, I would like to produce a GenomeData
> >>> object from either a
> >>> BAM file or a bed file.
> >>>
> >>> has someone ever worked with very very big BAM files
> >>> (about 30 GB)
> >>>
> >>> thanks
> >>>
> >>> Rene paradis
> >>>
> >>> _________________________________________________
> >>> Bioc-sig-sequencing mailing list
> >>> Bioc-sig-sequencing at r-project.__org
> >>> <mailto:Bioc-sig-sequencing at r-project.org>
> >>> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >>> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >>>
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> _________________________________________________
> >>> Bioc-sig-sequencing mailing list
> >>> Bioc-sig-sequencing at r-project.__org
> >>> <mailto:Bioc-sig-sequencing at r-project.org>
> >>> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >>> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Computational Biology
> >>> Fred Hutchinson Cancer Research Center
> >>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >>>
> >>> Location: M1-B861
> >>> Telephone: 206 667-2793<tel:206%20667-2793>
> >>>
> >>> _________________________________________________
> >>> Bioc-sig-sequencing mailing list
> >>> Bioc-sig-sequencing at r-project.__org
> >>> <mailto:Bioc-sig-sequencing at r-project.org>
> >>> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >>> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >>>
> >>>
> >>
> >>
> >
> >
>
>
More information about the Bioc-sig-sequencing
mailing list