[Bioc-sig-seq] large BAM files and large BED files
Rene Paradis
rene.paradis at genome.ulaval.ca
Tue Sep 20 14:56:29 CEST 2011
On Mon, 2011-09-19 at 12:10 -0700, Martin Morgan wrote:
> On 09/19/2011 11:55 AM, Michael Lawrence wrote:
> >
> >
> > On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan <mtmorgan at fhcrc.org
> > <mailto:mtmorgan at fhcrc.org>> wrote:
> >
> > On 09/19/2011 11:26 AM, Rene Paradis wrote:
> >
> > Thanks Martin and Michael for your constructive advices,
> >
> > I used the ScanBamParam object to successfully load a part of
> > the Chr1
> > from a Bam file via ScanBam. Honestly I do not know what are the
> > differences between readGappedAlignments, readBamGappedAlignment and
> > ScanBam. The last two of them can take a ScanBamParam object.
> >
> >
> > scanBam returns a list-of-lists, it's the most flexible but least
> > 'user-friendly'.
> >
> > readGappedAlignments is meant to be a 'front end' to read
> > GappedAlignments from several different sources, and
> > readBamGappedAlignments is meant to be one of those sources; usually
> > the 'user' would readGappedAlignments.
> >
> >
> > But I wished I could select the seqname in GRanges to retrieve
> > all the
> > chr1 (as an example) data from the Bam file. It seems I must
> > select a
> > range. So I put a value that goes beyond the range of the chr1
> > because I
> > do not know that range, and I got an<<INTEGER () can only be
> > applied to
> > a 'integer', not a special>>.
> >
> >
> > Couldn't Rsamtools give something more informative?
>
> The info in the original post isn't enough to understand how the error
> occurs.
>
To reproduce the original error:
choose a 30GB Bam file. and run:
dataIP <- read.table(fileName,header = TRUE, colClasses -
c("factor","integer","factor"))
after a while, it will trow the error.
the seqinfo above did not work for me, probably I do not have the latest
devel version of Seqinfo.
> fl
[1] "/opt/R/lib/Rsamtools/extdata/ex1.bam"
> seqinfo(open(BamFile(fl)))
Error in function (classes, fdef, mtable) :
unable to find an inherited method for function "seqinfo", for
signature "BamFile"
>
otherwise, we have splitted the Big bam into chromosomes with samtools.
We were then able to load these smaller files.
> >
> > There must be something I missed that
> > could help me doing that.
> >
> >
> > see ?scanBamHeader, e.g.,
> >
> > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> > > scanBamHeader(fl)[[1]]$targets
> > seq1 seq2
> > 1575 1584
> >
> > Would be nice to have a method for getting a Seqinfo out of a BAM
> > header. Then one can just coerce that to a GRanges. rtracklayer does the
> > equivalent for BigWig.
>
> In devel,
>
> > seqinfo(open(BamFile(fl)))
> Seqinfo of length 2
> seqnames seqlengths isCircular
> seq1 1575 NA
> seq2 1584 NA
>
> I think this needs to be updated to deal with recent changes to Seqinfo
> to store the reference genome (which is sometimes also present in the
> BAM file).
>
I think this update would be useful for us. Thank you for your support
Martin and Michael.
Rene
> Maritn
> >
> > Michael
> >
> > Martin
> >
> >
> >
> > ultimately, I want to launch a PICS analysis that requires a
> > segReadsList object.
> >
> > Overall I definitely progressed by your help, thank you.
> >
> > Rene
> >
> >
> >
> >
> > On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
> >
> > On 09/16/2011 02:11 PM, Michael Lawrence wrote:
> >
> > It sounds like you're trying to use BED as an
> > alternative to BAM? Probably
> > not a good idea, especially at this scale. Why are you
> > aiming for a
> > GenomeData? A GappedAlignments might be more
> > appropriate. See
> > GenomicRanges::__readGappedAlignments() for bringing a
> > BAM into a
> > GappedAlignments.
> >
> >
> > Hi Rene
> >
> > the 'which' argument to readGappedAlignments (it'll become
> > 'param' with
> > the next release, and be a ScanBamParam object) allows you
> > to select
> > regions to process, e.g., chromosome-at-a-time, to help with
> > file size.
> >
> > Martin
> >
> >
> > This page might help:
> > http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
> > <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
> >
> > But it could really be improved.
> >
> > Michael
> >
> > On Fri, Sep 16, 2011 at 1:44 PM, Rene
> > Paradis<rene.paradis at genome.__ulaval.ca
> > <mailto:rene.paradis at genome.ulaval.ca>
> >
> > wrote:
> >
> >
> > Hello,
> >
> > I am experiencing a problem regarding the load in
> > memory of bed files of
> > 30 GB. my function read.table unleash the error :
> > Error in unique(x) :
> > length xxxxxx is too large for hashing.
> >
> > this is generated by the function MKsetup of the
> > unique.c file. Even by
> > increasing by 10 000x the value, the error persists.
> > I believe the
> > function pushes more data in ram, but I am not sure
> > this is the good way
> > to focus on.
> >
> > Ultimately, I would like to produce a GenomeData
> > object from either a
> > BAM file or a bed file.
> >
> > has someone ever worked with very very big BAM files
> > (about 30 GB)
> >
> > thanks
> >
> > Rene paradis
> >
> > _________________________________________________
> > Bioc-sig-sequencing mailing list
> > Bioc-sig-sequencing at r-project.__org
> > <mailto:Bioc-sig-sequencing at r-project.org>
> > https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> > <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >
> >
> > [[alternative HTML version deleted]]
> >
> > _________________________________________________
> > Bioc-sig-sequencing mailing list
> > Bioc-sig-sequencing at r-project.__org
> > <mailto:Bioc-sig-sequencing at r-project.org>
> > https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> > <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >
> >
> >
> >
> >
> >
> >
> > --
> > Computational Biology
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >
> > Location: M1-B861
> > Telephone: 206 667-2793 <tel:206%20667-2793>
> >
> > _________________________________________________
> > Bioc-sig-sequencing mailing list
> > Bioc-sig-sequencing at r-project.__org
> > <mailto:Bioc-sig-sequencing at r-project.org>
> > https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> > <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >
> >
>
>
More information about the Bioc-sig-sequencing
mailing list