[Bioc-sig-seq] write.XStringSet() terribly slow
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Fri Apr 16 15:55:24 CEST 2010
I don't know if there has been a refactoring of the code, but I while
ago I send a patch to writeFASTA making it magnitudes faster, so you
should perhaps try that one. The patch makes it pretty fast to dump
entire bsgenomes into fasta files.
Kasper
On Fri, Apr 16, 2010 at 9:17 AM, Steffen Neumann <sneumann at ipb-halle.de> wrote:
> Hi,
>
> I have some major performance problems writing fasta files
> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
> and writing that to a file takes ages, as you see from the strace output
> below: I obtain ~5 lines (80 chars each) per second. The runtime
> of the system call <in brackets> is neglectible.
>
> library(Biostrings)
> chromosome <-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>
> Is there a fundamental flaw in my thinking ?
> Is there an alternative to write.XStringSet() ?
> This happens both on my laptop and a beefy server.
>
> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
> and get ~11 lines per second.
>
> Yours,
> Steffen
>
> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80 <0.000137>
> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80 <0.000142>
> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80 <0.000133>
> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80 <0.000159>
> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80 <0.000133>
> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80 <0.000136>
> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80 <0.000594>
>
> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.14.12 IRanges_1.4.16
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0
>
> --
> IPB Halle AG Massenspektrometrie & Bioinformatik
> Dr. Steffen Neumann http://www.IPB-Halle.DE
> Weinberg 3 http://msbi.bic-gh.de
> 06120 Halle Tel. +49 (0) 345 5582 - 1470
> +49 (0) 345 5582 - 0
> sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
More information about the Bioc-sig-sequencing
mailing list