[Bioc-sig-seq] write.XStringSet() terribly slow
Hans-Ulrich Klein
h.klein at uni-muenster.de
Wed May 5 15:37:56 CEST 2010
Hi,
I have have the same problem. I want to write ~ 4Mio small (25bps)
sequences into one fasta file. write.XStringSet() is very slow. Also,
writeFASTA() is very low. Only about 1500 sequences are written per minute.
Are there any alternatives?
Best wishes,
Hans-Ulrich
> sessionInfo()
R version 2.11.0 RC (2010-04-19 r51778)
x86_64-pc-linux-gnu
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.6.2 Rsamtools_1.0.1 lattice_0.18-5
[4] Biostrings_2.16.0 GenomicRanges_1.0.1 IRanges_1.6.0
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 grid_2.11.0 hwriter_1.2 tools_2.11.0
Steffen Neumann wrote:
> Hi,
>
> I have some major performance problems writing fasta files
> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
> and writing that to a file takes ages, as you see from the strace output
> below: I obtain ~5 lines (80 chars each) per second. The runtime
> of the system call<in brackets> is neglectible.
>
> library(Biostrings)
> chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>
> Is there a fundamental flaw in my thinking ?
> Is there an alternative to write.XStringSet() ?
> This happens both on my laptop and a beefy server.
>
> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
> and get ~11 lines per second.
>
> Yours,
> Steffen
>
> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80<0.000137>
> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80<0.000142>
> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80<0.000133>
> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80<0.000159>
> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80<0.000133>
> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80<0.000136>
> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80<0.000594>
>
> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.14.12 IRanges_1.4.16
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0
>
>
--
Hans-Ulrich Klein
Department of Medical Informatics and Biomathematics
University of Münster
Domagkstrasse 9
48149 Münster, Germany
Tel.: +49 (0)251 83-58405
More information about the Bioc-sig-sequencing
mailing list