[Bioc-sig-seq] large BAM files and large BED files
Martin Morgan
mtmorgan at fhcrc.org
Tue Sep 20 15:32:05 CEST 2011
On 09/20/2011 05:56 AM, Rene Paradis wrote:
> On Mon, 2011-09-19 at 12:10 -0700, Martin Morgan wrote:
>> On 09/19/2011 11:55 AM, Michael Lawrence wrote:
>>>
>>>
>>> On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan<mtmorgan at fhcrc.org
>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>
>>> On 09/19/2011 11:26 AM, Rene Paradis wrote:
>>>
>>> Thanks Martin and Michael for your constructive advices,
>>>
>>> I used the ScanBamParam object to successfully load a part of
>>> the Chr1
>>> from a Bam file via ScanBam. Honestly I do not know what are the
>>> differences between readGappedAlignments, readBamGappedAlignment and
>>> ScanBam. The last two of them can take a ScanBamParam object.
>>>
>>>
>>> scanBam returns a list-of-lists, it's the most flexible but least
>>> 'user-friendly'.
>>>
>>> readGappedAlignments is meant to be a 'front end' to read
>>> GappedAlignments from several different sources, and
>>> readBamGappedAlignments is meant to be one of those sources; usually
>>> the 'user' would readGappedAlignments.
>>>
>>>
>>> But I wished I could select the seqname in GRanges to retrieve
>>> all the
>>> chr1 (as an example) data from the Bam file. It seems I must
>>> select a
>>> range. So I put a value that goes beyond the range of the chr1
>>> because I
>>> do not know that range, and I got an<<INTEGER () can only be
>>> applied to
>>> a 'integer', not a special>>.
>>>
>>>
>>> Couldn't Rsamtools give something more informative?
>>
>> The info in the original post isn't enough to understand how the error
>> occurs.
>>
> To reproduce the original error:
> choose a 30GB Bam file. and run:
> dataIP<- read.table(fileName,header = TRUE, colClasses -
> c("factor","integer","factor"))
>
> after a while, it will trow the error.
>
>
> the seqinfo above did not work for me, probably I do not have the latest
> devel version of Seqinfo.
>
>> fl
> [1] "/opt/R/lib/Rsamtools/extdata/ex1.bam"
>> seqinfo(open(BamFile(fl)))
> Error in function (classes, fdef, mtable) :
> unable to find an inherited method for function "seqinfo", for
> signature "BamFile"
>>
>
>
> otherwise, we have splitted the Big bam into chromosomes with samtools.
> We were then able to load these smaller files.
Hi Rene --
Here's a BAM file, for example
library(Rsamtools)
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
Get the information about each sequence length from the BAM file, and
make it a GRanges object, 1 ranges per sequence
t <- scanBamHeader(fl)[[1]][["targets"]]
which <- GRanges(names(t), IRanges(1, unname(t)))
Then iterate over the object, either in a 'for' loop
for (i in seq_along(which)) {
## read one chromosome
param <- ScanBamParam(which=which[i], what=character())
aln <- readGappedAlignments(fl, param=param)
## do more work...
}
or lapply
lapply(seq_along(which), function(i, fl, which) {
param <- ScanBamParam(which=which[i], what=character())
aln <- readGappedAlignments(fl, param=param)
## do more work
table(width(aln))
}, fl, which)
Hope that helps,
Martin
>
>
>
>>>
>>> There must be something I missed that
>>> could help me doing that.
>>>
>>>
>>> see ?scanBamHeader, e.g.,
>>>
>>> > fl<- system.file("extdata", "ex1.bam", package="Rsamtools")
>>> > scanBamHeader(fl)[[1]]$targets
>>> seq1 seq2
>>> 1575 1584
>>>
>>> Would be nice to have a method for getting a Seqinfo out of a BAM
>>> header. Then one can just coerce that to a GRanges. rtracklayer does the
>>> equivalent for BigWig.
>>
>> In devel,
>>
>> > seqinfo(open(BamFile(fl)))
>> Seqinfo of length 2
>> seqnames seqlengths isCircular
>> seq1 1575 NA
>> seq2 1584 NA
>>
>> I think this needs to be updated to deal with recent changes to Seqinfo
>> to store the reference genome (which is sometimes also present in the
>> BAM file).
>>
> I think this update would be useful for us. Thank you for your support
> Martin and Michael.
>
> Rene
>
>> Maritn
>>>
>>> Michael
>>>
>>> Martin
>>>
>>>
>>>
>>> ultimately, I want to launch a PICS analysis that requires a
>>> segReadsList object.
>>>
>>> Overall I definitely progressed by your help, thank you.
>>>
>>> Rene
>>>
>>>
>>>
>>>
>>> On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
>>>
>>> On 09/16/2011 02:11 PM, Michael Lawrence wrote:
>>>
>>> It sounds like you're trying to use BED as an
>>> alternative to BAM? Probably
>>> not a good idea, especially at this scale. Why are you
>>> aiming for a
>>> GenomeData? A GappedAlignments might be more
>>> appropriate. See
>>> GenomicRanges::__readGappedAlignments() for bringing a
>>> BAM into a
>>> GappedAlignments.
>>>
>>>
>>> Hi Rene
>>>
>>> the 'which' argument to readGappedAlignments (it'll become
>>> 'param' with
>>> the next release, and be a ScanBamParam object) allows you
>>> to select
>>> regions to process, e.g., chromosome-at-a-time, to help with
>>> file size.
>>>
>>> Martin
>>>
>>>
>>> This page might help:
>>> http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
>>> <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
>>>
>>> But it could really be improved.
>>>
>>> Michael
>>>
>>> On Fri, Sep 16, 2011 at 1:44 PM, Rene
>>> Paradis<rene.paradis at genome.__ulaval.ca
>>> <mailto:rene.paradis at genome.ulaval.ca>
>>>
>>> wrote:
>>>
>>>
>>> Hello,
>>>
>>> I am experiencing a problem regarding the load in
>>> memory of bed files of
>>> 30 GB. my function read.table unleash the error :
>>> Error in unique(x) :
>>> length xxxxxx is too large for hashing.
>>>
>>> this is generated by the function MKsetup of the
>>> unique.c file. Even by
>>> increasing by 10 000x the value, the error persists.
>>> I believe the
>>> function pushes more data in ram, but I am not sure
>>> this is the good way
>>> to focus on.
>>>
>>> Ultimately, I would like to produce a GenomeData
>>> object from either a
>>> BAM file or a bed file.
>>>
>>> has someone ever worked with very very big BAM files
>>> (about 30 GB)
>>>
>>> thanks
>>>
>>> Rene paradis
>>>
>>> _________________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.__org
>>> <mailto:Bioc-sig-sequencing at r-project.org>
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _________________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.__org
>>> <mailto:Bioc-sig-sequencing at r-project.org>
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biology
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>
>>> Location: M1-B861
>>> Telephone: 206 667-2793<tel:206%20667-2793>
>>>
>>> _________________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.__org
>>> <mailto:Bioc-sig-sequencing at r-project.org>
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
>>>
>>>
>>
>>
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list