[BioC] running time: countOverlaps & summarizedOverlaps vs. HTSeq
Nicolas Delhomme
delhomme at embl.de
Thu Mar 29 18:33:48 CEST 2012
Thanks Michael!
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On 29 Mar 2012, at 18:28, Michael Lawrence wrote:
> In devel rtracklayer, SeqinfoForUCSCGenome("hg19") does the equivalent of Herve's code. There is also a SeqinfoForBSGenome(). It returns NULL if the genome package is not found, in which case you can fall back to UCSC. That's what rtracklayer does internally when you pass a genome argument to import.
>
> Michael
>
> On Thu, Mar 29, 2012 at 5:39 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
> Hi Steve,
>
> Yes that's true, I've been using that too, but I still encounter BAM files without an header section. Combining both your input will be a way to avoid BSgenome :-)
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 29 Mar 2012, at 03:20, Steve Lianoglou wrote:
>
> > Hi,
> >
> > 2012/3/28 Hervé Pagès <hpages at fhcrc.org>:
> >> Hi Nico, Milica,
> >>
> >>
> >> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote:
> >>>
> >>> Hi Milica,
> >>>
> >>> I do the exact same thing in my package (easyRNASeq, still in the devel
> >>> branch of Bioc) and it definitely does not require 20 hours to read "only"
> >>> 20 million reads. Are you sure you are not getting your machine to swap?
> >>> I.e. did you monitor the memory usage?
> >>>
> >>> It would be interesting (for me, at least) if you could try my package to
> >>> get your count table. You can either retrieve the annotation from biomaRt or
> >>> provide a GFF file. See the vignette of the package for the details and
> >>> maybe these two posts on that mailing list:
> >>>
> >>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html
> >>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044124.html
> >>
> >>
> >> Slightly off-topic but probably worth mentioning is that you don't
> >> need to use a BSgenome package in order to fetch the chromosome
> >> lengths of an organism. For example in the code you show in the
> >> 2 above posts, you could use makeTranscriptDbFromUCSC() (or
> >> makeTranscriptDbFromBiomart(), both from the GenomicFeatures
> >> package) to make a TranscriptDb object, and then to do seqlengths()
> >> on that object (BTW it would be nice if this worked with
> >> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages
> >> for hg19 or mm9 already installed, that should be faster than
> >> installing them.
> >
> > Also, the BAM file you are using as the source of your reads has the
> > seqlength info in it, eg:
> >
> > R> library(Rsamtools)
> > R> bf <- BamFile('/path/to/your.bam')
> > R> seqlengths(bf) ## hg19
> > chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8
> > 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022
> > chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16
> > 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753
> > chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
> > 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566
> > chrM
> > 16571
> >
> > Perhaps that's a helpful piece of trivia to know, too ...
> >
> > -steve
> >
> >
> >>
> >> Or, even faster:
> >>
> >>> library(rtracklayer)
> >>> session <- browserSession()
> >>> genome(session) <- "mm9"
> >>> seqlengths(session)
> >> chr1 chr1_random chr2 chr3 chr3_random chr4
> >> 197195432 1231697 181748087 159599783 41899 155630120
> >> chr4_random chr5 chr5_random chr6 chr7 chr7_random
> >> 160594 152537259 357350 149517037 152524553 362490
> >> chr8 chr8_random chr9 chr9_random chr10 chr11
> >> 131738871 849593 124076172 449403 129993255 121843856
> >> chr12 chr13 chr13_random chr14 chr15 chr16
> >> 121257530 120284312 400311 125194864 103494974 98319150
> >> chr16_random chr17 chr17_random chr18 chr19 chrX
> >> 3994 95272651 628739 90772031 61342430 166650296
> >> chrX_random chrY chrY_random chrUn_random chrM
> >> 1785075 15902555 58682461 5900358 16299
> >>
> >> However, those alternative require internet access...
> >>
> >> Cheers,
> >> H.
> >>
> >>
> >>>
> >>> For addressing if from the countOverlaps / summarizedOverlaps point of
> >>> view, it would help if you could post your code and sessionInfo().
> >>>
> >>> HTH,
> >>>
> >>> Nico
> >>>
> >>> ---------------------------------------------------------------
> >>> Nicolas Delhomme
> >>>
> >>> Genome Biology Computational Support
> >>>
> >>> European Molecular Biology Laboratory
> >>>
> >>> Tel: +49 6221 387 8310
> >>> Email: nicolas.delhomme at embl.de
> >>> Meyerhofstrasse 1 - Postfach 10.2209
> >>> 69102 Heidelberg, Germany
> >>> ---------------------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 28 Mar 2012, at 13:04, Milica Krunic wrote:
> >>>
> >>>> Hello!
> >>>>
> >>>>
> >>>>
> >>>> I am working with cat RNA Seq data and after mapping I wanted to get the
> >>>> count tables. So, I tried to do it using countOverlaps and
> >>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R,
> >>>> per
> >>>> one sorted .bam file (~20*10^6 reads), no matter which previously
> >>>> mentioned
> >>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R
> >>>> methods I used GRangesList object downloaded directly in R from Biomart.
> >>>> In
> >>>> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene
> >>>> width
> >>>> is about 44000 in GRangesList.
> >>>> Does anyone know why getting count tables in R takes so long compared to
> >>>> HTSeq?
> >>>>
> >>>>
> >>>> Thank you!
> >>>>
> >>>> Best,
> >>>> Milica
> >>>>
> >>>> [[alternative HTML version deleted]]
> >>>>
> >>>> _______________________________________________
> >>>> Bioconductor mailing list
> >>>> Bioconductor at r-project.org
> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>> Search the archives:
> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Program in Computational Biology
> >> Division of Public Health Sciences
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514
> >> P.O. Box 19024
> >> Seattle, WA 98109-1024
> >>
> >> E-mail: hpages at fhcrc.org
> >> Phone: (206) 667-5791
> >> Fax: (206) 667-1319
> >>
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
> >
> > --
> > Steve Lianoglou
> > Graduate Student: Computational Systems Biology
> > | Memorial Sloan-Kettering Cancer Center
> > | Weill Medical College of Cornell University
> > Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list