[BioC] running time: countOverlaps & summarizedOverlaps vs. HTSeq
Hervé Pagès
hpages at fhcrc.org
Thu Mar 29 02:57:56 CEST 2012
Hi Nico, Milica,
On 03/28/2012 04:21 AM, Nicolas Delhomme wrote:
> Hi Milica,
>
> I do the exact same thing in my package (easyRNASeq, still in the devel branch of Bioc) and it definitely does not require 20 hours to read "only" 20 million reads. Are you sure you are not getting your machine to swap? I.e. did you monitor the memory usage?
>
> It would be interesting (for me, at least) if you could try my package to get your count table. You can either retrieve the annotation from biomaRt or provide a GFF file. See the vignette of the package for the details and maybe these two posts on that mailing list:
>
> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html
> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044124.html
Slightly off-topic but probably worth mentioning is that you don't
need to use a BSgenome package in order to fetch the chromosome
lengths of an organism. For example in the code you show in the
2 above posts, you could use makeTranscriptDbFromUCSC() (or
makeTranscriptDbFromBiomart(), both from the GenomicFeatures
package) to make a TranscriptDb object, and then to do seqlengths()
on that object (BTW it would be nice if this worked with
makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages
for hg19 or mm9 already installed, that should be faster than
installing them.
Or, even faster:
> library(rtracklayer)
> session <- browserSession()
> genome(session) <- "mm9"
> seqlengths(session)
chr1 chr1_random chr2 chr3 chr3_random
chr4
197195432 1231697 181748087 159599783 41899
155630120
chr4_random chr5 chr5_random chr6 chr7
chr7_random
160594 152537259 357350 149517037 152524553
362490
chr8 chr8_random chr9 chr9_random chr10
chr11
131738871 849593 124076172 449403 129993255
121843856
chr12 chr13 chr13_random chr14 chr15
chr16
121257530 120284312 400311 125194864 103494974
98319150
chr16_random chr17 chr17_random chr18 chr19
chrX
3994 95272651 628739 90772031 61342430
166650296
chrX_random chrY chrY_random chrUn_random chrM
1785075 15902555 58682461 5900358 16299
However, those alternative require internet access...
Cheers,
H.
>
> For addressing if from the countOverlaps / summarizedOverlaps point of view, it would help if you could post your code and sessionInfo().
>
> HTH,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 28 Mar 2012, at 13:04, Milica Krunic wrote:
>
>> Hello!
>>
>>
>>
>> I am working with cat RNA Seq data and after mapping I wanted to get the
>> count tables. So, I tried to do it using countOverlaps and
>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R, per
>> one sorted .bam file (~20*10^6 reads), no matter which previously mentioned
>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R
>> methods I used GRangesList object downloaded directly in R from Biomart. In
>> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene width
>> is about 44000 in GRangesList.
>> Does anyone know why getting count tables in R takes so long compared to
>> HTSeq?
>>
>>
>> Thank you!
>>
>> Best,
>> Milica
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list