[BioC] running time: countOverlaps & summarizedOverlaps vs. HTSeq
Nicolas Delhomme
delhomme at embl.de
Thu Mar 29 14:39:03 CEST 2012
Hi Steve,
Yes that's true, I've been using that too, but I still encounter BAM files without an header section. Combining both your input will be a way to avoid BSgenome :-)
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On 29 Mar 2012, at 03:20, Steve Lianoglou wrote:
> Hi,
>
> 2012/3/28 Hervé Pagès <hpages at fhcrc.org>:
>> Hi Nico, Milica,
>>
>>
>> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote:
>>>
>>> Hi Milica,
>>>
>>> I do the exact same thing in my package (easyRNASeq, still in the devel
>>> branch of Bioc) and it definitely does not require 20 hours to read "only"
>>> 20 million reads. Are you sure you are not getting your machine to swap?
>>> I.e. did you monitor the memory usage?
>>>
>>> It would be interesting (for me, at least) if you could try my package to
>>> get your count table. You can either retrieve the annotation from biomaRt or
>>> provide a GFF file. See the vignette of the package for the details and
>>> maybe these two posts on that mailing list:
>>>
>>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html
>>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044124.html
>>
>>
>> Slightly off-topic but probably worth mentioning is that you don't
>> need to use a BSgenome package in order to fetch the chromosome
>> lengths of an organism. For example in the code you show in the
>> 2 above posts, you could use makeTranscriptDbFromUCSC() (or
>> makeTranscriptDbFromBiomart(), both from the GenomicFeatures
>> package) to make a TranscriptDb object, and then to do seqlengths()
>> on that object (BTW it would be nice if this worked with
>> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages
>> for hg19 or mm9 already installed, that should be faster than
>> installing them.
>
> Also, the BAM file you are using as the source of your reads has the
> seqlength info in it, eg:
>
> R> library(Rsamtools)
> R> bf <- BamFile('/path/to/your.bam')
> R> seqlengths(bf) ## hg19
> chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8
> 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022
> chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16
> 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753
> chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
> 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566
> chrM
> 16571
>
> Perhaps that's a helpful piece of trivia to know, too ...
>
> -steve
>
>
>>
>> Or, even faster:
>>
>>> library(rtracklayer)
>>> session <- browserSession()
>>> genome(session) <- "mm9"
>>> seqlengths(session)
>> chr1 chr1_random chr2 chr3 chr3_random chr4
>> 197195432 1231697 181748087 159599783 41899 155630120
>> chr4_random chr5 chr5_random chr6 chr7 chr7_random
>> 160594 152537259 357350 149517037 152524553 362490
>> chr8 chr8_random chr9 chr9_random chr10 chr11
>> 131738871 849593 124076172 449403 129993255 121843856
>> chr12 chr13 chr13_random chr14 chr15 chr16
>> 121257530 120284312 400311 125194864 103494974 98319150
>> chr16_random chr17 chr17_random chr18 chr19 chrX
>> 3994 95272651 628739 90772031 61342430 166650296
>> chrX_random chrY chrY_random chrUn_random chrM
>> 1785075 15902555 58682461 5900358 16299
>>
>> However, those alternative require internet access...
>>
>> Cheers,
>> H.
>>
>>
>>>
>>> For addressing if from the countOverlaps / summarizedOverlaps point of
>>> view, it would help if you could post your code and sessionInfo().
>>>
>>> HTH,
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Genome Biology Computational Support
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>> On 28 Mar 2012, at 13:04, Milica Krunic wrote:
>>>
>>>> Hello!
>>>>
>>>>
>>>>
>>>> I am working with cat RNA Seq data and after mapping I wanted to get the
>>>> count tables. So, I tried to do it using countOverlaps and
>>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R,
>>>> per
>>>> one sorted .bam file (~20*10^6 reads), no matter which previously
>>>> mentioned
>>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R
>>>> methods I used GRangesList object downloaded directly in R from Biomart.
>>>> In
>>>> HTSeq I used GTF file provided on Ensembl homepage. Average cat gene
>>>> width
>>>> is about 44000 in GRangesList.
>>>> Does anyone know why getting count tables in R takes so long compared to
>>>> HTSeq?
>>>>
>>>>
>>>> Thank you!
>>>>
>>>> Best,
>>>> Milica
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list