[Bioc-sig-seq] large BAM files and large BED files

Mon Sep 19 21:10:31 CEST 2011

On 09/19/2011 11:55 AM, Michael Lawrence wrote:
>
>
> On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     On 09/19/2011 11:26 AM, Rene Paradis wrote:
>
>         Thanks Martin and Michael for your constructive advices,
>
>         I used the ScanBamParam object to successfully load a part of
>         the Chr1
>         from a Bam file via ScanBam. Honestly I do not know what are the
>         differences between readGappedAlignments, readBamGappedAlignment and
>         ScanBam. The last two of them can take a  ScanBamParam object.
>
>
>     scanBam returns a list-of-lists, it's the most flexible but least
>     'user-friendly'.
>
>     readGappedAlignments is meant to be a 'front end' to read
>     GappedAlignments from several different sources, and
>     readBamGappedAlignments is meant to be one of those sources; usually
>     the 'user' would readGappedAlignments.
>
>
>         But I wished I could select the seqname in GRanges to retrieve
>         all the
>         chr1 (as an example) data from the Bam file. It seems I must
>         select a
>         range. So I put a value that goes beyond the range of the chr1
>         because I
>         do not know that range, and I got an<<INTEGER () can only be
>         applied to
>         a 'integer', not a special>>.
>
>
> Couldn't Rsamtools give something more informative?

The info in the original post isn't enough to understand how the error 
occurs.

>
>         There must be something I missed that
>         could help me doing that.
>
>
>     see ?scanBamHeader, e.g.,
>
>      >  fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>      > scanBamHeader(fl)[[1]]$targets
>     seq1 seq2
>     1575 1584
>
> Would be nice to have a method for getting a Seqinfo out of a BAM
> header. Then one can just coerce that to a GRanges. rtracklayer does the
> equivalent for BigWig.

In devel,

 > seqinfo(open(BamFile(fl)))
Seqinfo of length 2
seqnames seqlengths isCircular
seq1           1575         NA
seq2           1584         NA

I think this needs to be updated to deal with recent changes to Seqinfo 
to store the reference genome (which is sometimes also present in the 
BAM file).

Maritn
>
> Michael
>
>     Martin
>
>
>
>         ultimately, I want to launch a PICS analysis that requires a
>         segReadsList object.
>
>         Overall I definitely progressed by your help, thank you.
>
>         Rene
>
>
>
>
>         On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
>
>             On 09/16/2011 02:11 PM, Michael Lawrence wrote:
>
>                 It sounds like you're trying to use BED as an
>                 alternative to BAM? Probably
>                 not a good idea, especially at this scale. Why are you
>                 aiming for a
>                 GenomeData? A GappedAlignments might be more
>                 appropriate. See
>                 GenomicRanges::__readGappedAlignments() for bringing a
>                 BAM into a
>                 GappedAlignments.
>
>
>             Hi Rene
>
>             the 'which' argument to readGappedAlignments (it'll become
>             'param' with
>             the next release, and be a ScanBamParam object) allows you
>             to select
>             regions to process, e.g., chromosome-at-a-time, to help with
>             file size.
>
>             Martin
>
>
>                 This page might help:
>                 http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
>                 <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
>
>                 But it could really be improved.
>
>                 Michael
>
>                 On Fri, Sep 16, 2011 at 1:44 PM, Rene
>                 Paradis<rene.paradis at genome.__ulaval.ca
>                 <mailto:rene.paradis at genome.ulaval.ca>
>
>                     wrote:
>
>
>                     Hello,
>
>                     I am experiencing a problem regarding the load in
>                     memory of bed files of
>                     30 GB. my function read.table unleash the error :
>                     Error in unique(x) :
>                     length xxxxxx is too large for hashing.
>
>                     this is generated by the function MKsetup of the
>                     unique.c file. Even by
>                     increasing by 10 000x the value, the error persists.
>                     I believe the
>                     function pushes more data in ram, but I am not sure
>                     this is the good way
>                     to focus on.
>
>                     Ultimately, I would like to produce a GenomeData
>                     object from either a
>                     BAM file or a bed file.
>
>                     has someone ever worked with very very big BAM files
>                     (about 30 GB)
>
>                     thanks
>
>                     Rene paradis
>
>                     _________________________________________________
>                     Bioc-sig-sequencing mailing list
>                     Bioc-sig-sequencing at r-project.__org
>                     <mailto:Bioc-sig-sequencing at r-project.org>
>                     https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>                     <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>
>
>                         [[alternative HTML version deleted]]
>
>                 _________________________________________________
>                 Bioc-sig-sequencing mailing list
>                 Bioc-sig-sequencing at r-project.__org
>                 <mailto:Bioc-sig-sequencing at r-project.org>
>                 https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>                 <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>
>
>
>
>
>
>
>     --
>     Computational Biology
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
>     Location: M1-B861
>     Telephone: 206 667-2793 <tel:206%20667-2793>
>
>     _________________________________________________
>     Bioc-sig-sequencing mailing list
>     Bioc-sig-sequencing at r-project.__org
>     <mailto:Bioc-sig-sequencing at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>     <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>
>

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793