[Bioc-sig-seq] large BAM files and large BED files
Martin Morgan
mtmorgan at fhcrc.org
Mon Sep 19 21:10:31 CEST 2011
On 09/19/2011 11:55 AM, Michael Lawrence wrote:
>
>
> On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> On 09/19/2011 11:26 AM, Rene Paradis wrote:
>
> Thanks Martin and Michael for your constructive advices,
>
> I used the ScanBamParam object to successfully load a part of
> the Chr1
> from a Bam file via ScanBam. Honestly I do not know what are the
> differences between readGappedAlignments, readBamGappedAlignment and
> ScanBam. The last two of them can take a ScanBamParam object.
>
>
> scanBam returns a list-of-lists, it's the most flexible but least
> 'user-friendly'.
>
> readGappedAlignments is meant to be a 'front end' to read
> GappedAlignments from several different sources, and
> readBamGappedAlignments is meant to be one of those sources; usually
> the 'user' would readGappedAlignments.
>
>
> But I wished I could select the seqname in GRanges to retrieve
> all the
> chr1 (as an example) data from the Bam file. It seems I must
> select a
> range. So I put a value that goes beyond the range of the chr1
> because I
> do not know that range, and I got an<<INTEGER () can only be
> applied to
> a 'integer', not a special>>.
>
>
> Couldn't Rsamtools give something more informative?
The info in the original post isn't enough to understand how the error
occurs.
>
> There must be something I missed that
> could help me doing that.
>
>
> see ?scanBamHeader, e.g.,
>
> > fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> > scanBamHeader(fl)[[1]]$targets
> seq1 seq2
> 1575 1584
>
> Would be nice to have a method for getting a Seqinfo out of a BAM
> header. Then one can just coerce that to a GRanges. rtracklayer does the
> equivalent for BigWig.
In devel,
> seqinfo(open(BamFile(fl)))
Seqinfo of length 2
seqnames seqlengths isCircular
seq1 1575 NA
seq2 1584 NA
I think this needs to be updated to deal with recent changes to Seqinfo
to store the reference genome (which is sometimes also present in the
BAM file).
Maritn
>
> Michael
>
> Martin
>
>
>
> ultimately, I want to launch a PICS analysis that requires a
> segReadsList object.
>
> Overall I definitely progressed by your help, thank you.
>
> Rene
>
>
>
>
> On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
>
> On 09/16/2011 02:11 PM, Michael Lawrence wrote:
>
> It sounds like you're trying to use BED as an
> alternative to BAM? Probably
> not a good idea, especially at this scale. Why are you
> aiming for a
> GenomeData? A GappedAlignments might be more
> appropriate. See
> GenomicRanges::__readGappedAlignments() for bringing a
> BAM into a
> GappedAlignments.
>
>
> Hi Rene
>
> the 'which' argument to readGappedAlignments (it'll become
> 'param' with
> the next release, and be a ScanBamParam object) allows you
> to select
> regions to process, e.g., chromosome-at-a-time, to help with
> file size.
>
> Martin
>
>
> This page might help:
> http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
> <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
>
> But it could really be improved.
>
> Michael
>
> On Fri, Sep 16, 2011 at 1:44 PM, Rene
> Paradis<rene.paradis at genome.__ulaval.ca
> <mailto:rene.paradis at genome.ulaval.ca>
>
> wrote:
>
>
> Hello,
>
> I am experiencing a problem regarding the load in
> memory of bed files of
> 30 GB. my function read.table unleash the error :
> Error in unique(x) :
> length xxxxxx is too large for hashing.
>
> this is generated by the function MKsetup of the
> unique.c file. Even by
> increasing by 10 000x the value, the error persists.
> I believe the
> function pushes more data in ram, but I am not sure
> this is the good way
> to focus on.
>
> Ultimately, I would like to produce a GenomeData
> object from either a
> BAM file or a bed file.
>
> has someone ever worked with very very big BAM files
> (about 30 GB)
>
> thanks
>
> Rene paradis
>
> _________________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.__org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.__org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>
>
>
>
>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793 <tel:206%20667-2793>
>
> _________________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.__org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list