[BioC] running time: countOverlaps & summarizedOverlaps vs. HTSeq

Thu Mar 29 18:33:48 CEST 2012

Thanks Michael!

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On 29 Mar 2012, at 18:28, Michael Lawrence wrote:

> In devel rtracklayer, SeqinfoForUCSCGenome("hg19") does the equivalent of Herve's code. There is also a SeqinfoForBSGenome(). It returns NULL if the genome package is not found, in which case you can fall back to UCSC. That's what rtracklayer does internally when you pass a genome argument to import.
> 
> Michael
> 
> On Thu, Mar 29, 2012 at 5:39 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
> Hi Steve,
> 
> Yes that's true, I've been using that too, but I still encounter BAM files without an header section. Combining both your input will be a way to avoid BSgenome :-)
> 
> Cheers,
> 
> Nico
> 
> ---------------------------------------------------------------
> Nicolas Delhomme
> 
> Genome Biology Computational Support
> 
> European Molecular Biology Laboratory
> 
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
> 
> 
> 
> 
> 
> On 29 Mar 2012, at 03:20, Steve Lianoglou wrote:
> 
> > Hi,
> >
> > 2012/3/28 Hervé Pagès <hpages at fhcrc.org>:
> >> Hi Nico, Milica,
> >>
> >>
> >> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote:
> >>>
> >>> Hi Milica,
> >>>
> >>> I do the exact same thing in my package (easyRNASeq, still in the devel
> >>> branch of Bioc) and it definitely does not require 20 hours to read "only"
> >>> 20 million reads. Are you sure you are not getting your machine to swap?
> >>> I.e. did you monitor the memory usage?
> >>>
> >>> It would be interesting (for me, at least) if you could try my package to
> >>> get your count table. You can either retrieve the annotation from biomaRt or
> >>> provide a GFF file. See the vignette of the package for the details and
> >>> maybe these two posts on that mailing list:
> >>>
> >>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html
> >>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044124.html
> >>
> >>
> >> Slightly off-topic but probably worth mentioning is that you don't
> >> need to use a BSgenome package in order to fetch the chromosome
> >> lengths of an organism. For example in the code you show in the
> >> 2 above posts, you could use makeTranscriptDbFromUCSC() (or
> >> makeTranscriptDbFromBiomart(), both from the GenomicFeatures
> >> package) to make a TranscriptDb object, and then to do seqlengths()
> >> on that object (BTW it would be nice if this worked with
> >> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages
> >> for hg19 or mm9 already installed, that should be faster than
> >> installing them.
> >
> > Also, the BAM file you are using as the source of your reads has the
> > seqlength info in it, eg:
> >
> > R> library(Rsamtools)
> > R> bf <- BamFile('/path/to/your.bam')
> > R> seqlengths(bf)   ## hg19
> >     chr1      chr2      chr3      chr4      chr5      chr6      chr7      chr8
> > 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022
> >     chr9     chr10     chr11     chr12     chr13     chr14     chr15     chr16
> > 141213431 135534747 135006516 133851895 115169878 107349540 102531392  90354753
> >    chr17     chr18     chr19     chr20     chr21     chr22      chrX      chrY
> > 81195210  78077248  59128983  63025520  48129895  51304566 155270560  59373566
> >     chrM
> >    16571
> >
> > Perhaps that's a helpful piece of trivia to know, too ...
> >
> > -steve
> >
> >
> >>
> >> Or, even faster:
> >>
> >>> library(rtracklayer)
> >>> session <- browserSession()
> >>> genome(session) <- "mm9"
> >>> seqlengths(session)
> >>        chr1  chr1_random         chr2         chr3  chr3_random   chr4
> >>   197195432      1231697    181748087    159599783        41899 155630120
> >>  chr4_random         chr5  chr5_random         chr6         chr7 chr7_random
> >>      160594    152537259       357350    149517037    152524553 362490
> >>        chr8  chr8_random         chr9  chr9_random        chr10  chr11
> >>   131738871       849593    124076172       449403    129993255 121843856
> >>       chr12        chr13 chr13_random        chr14        chr15  chr16
> >>   121257530    120284312       400311    125194864    103494974 98319150
> >> chr16_random        chr17 chr17_random        chr18        chr19  chrX
> >>        3994     95272651       628739     90772031     61342430 166650296
> >>  chrX_random         chrY  chrY_random chrUn_random         chrM
> >>     1785075     15902555     58682461      5900358        16299
> >>
> >> However, those alternative require internet access...
> >>
> >> Cheers,
> >> H.
> >>
> >>
> >>>
> >>> For addressing if from the countOverlaps / summarizedOverlaps point of
> >>> view, it would help if you could post your code and sessionInfo().
> >>>
> >>> HTH,
> >>>
> >>> Nico
> >>>
> >>> ---------------------------------------------------------------
> >>> Nicolas Delhomme
> >>>
> >>> Genome Biology Computational Support
> >>>
> >>> European Molecular Biology Laboratory
> >>>
> >>> Tel: +49 6221 387 8310
> >>> Email: nicolas.delhomme at embl.de
> >>> Meyerhofstrasse 1 - Postfach 10.2209
> >>> 69102 Heidelberg, Germany
> >>> ---------------------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 28 Mar 2012, at 13:04, Milica Krunic wrote:
> >>>
> >>>> Hello!
> >>>>
> >>>>
> >>>>
> >>>> I am working with cat RNA Seq data and after mapping I wanted to get the
> >>>> count tables. So, I tried to do it using countOverlaps and
> >>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R,
> >>>> per
> >>>> one sorted .bam file (~20*10^6 reads), no matter which previously
> >>>> mentioned
> >>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R
> >>>> methods I used GRangesList object downloaded directly in R from Biomart.
> >>>> In
> >>>> HTSeq I used GTF file provided on Ensembl homepage. Average  cat gene
> >>>> width
> >>>> is about 44000 in GRangesList.
> >>>> Does anyone know why getting count tables in R takes so long compared to
> >>>> HTSeq?
> >>>>
> >>>>
> >>>> Thank you!
> >>>>
> >>>> Best,
> >>>> Milica
> >>>>
> >>>>        [[alternative HTML version deleted]]
> >>>>
> >>>> _______________________________________________
> >>>> Bioconductor mailing list
> >>>> Bioconductor at r-project.org
> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>> Search the archives:
> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Program in Computational Biology
> >> Division of Public Health Sciences
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514
> >> P.O. Box 19024
> >> Seattle, WA 98109-1024
> >>
> >> E-mail: hpages at fhcrc.org
> >> Phone:  (206) 667-5791
> >> Fax:    (206) 667-1319
> >>
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
> >
> > --
> > Steve Lianoglou
> > Graduate Student: Computational Systems Biology
> >  | Memorial Sloan-Kettering Cancer Center
> >  | Weill Medical College of Cornell University
> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>