[Bioc-sig-seq] write.XStringSet() terribly slow
Hervé Pagès
hpages at fhcrc.org
Wed May 5 23:09:27 CEST 2010
Hans-Ulrich Klein wrote:
> Hi,
>
> I have have the same problem. I want to write ~ 4Mio small (25bps)
> sequences into one fasta file. write.XStringSet() is very slow. Also,
> writeFASTA() is very low. Only about 1500 sequences are written per minute.
OK, I guess it's time to bite the bullet as they say.
It has been on my TODO list for a long time to implement
write.XStringSet() in C so I will work on this and let you
know when it's ready.
Cheers,
H.
>
> Are there any alternatives?
>
> Best wishes,
> Hans-Ulrich
>
>
> > sessionInfo()
> R version 2.11.0 RC (2010-04-19 r51778)
> x86_64-pc-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ShortRead_1.6.2 Rsamtools_1.0.1 lattice_0.18-5
> [4] Biostrings_2.16.0 GenomicRanges_1.0.1 IRanges_1.6.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0 grid_2.11.0 hwriter_1.2 tools_2.11.0
>
>
>
>
>
> Steffen Neumann wrote:
>> Hi,
>>
>> I have some major performance problems writing fasta files
>> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one
>> DNAString,
>> and writing that to a file takes ages, as you see from the strace output
>> below: I obtain ~5 lines (80 chars each) per second. The runtime
>> of the system call<in brackets> is neglectible.
>>
>> library(Biostrings)
>> chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
>> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>>
>> Is there a fundamental flaw in my thinking ?
>> Is there an alternative to write.XStringSet() ?
>> This happens both on my laptop and a beefy server.
>>
>> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
>> and get ~11 lines per second.
>>
>> Yours,
>> Steffen
>>
>> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) =
>> 80<0.000137>
>> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) =
>> 80<0.000142>
>> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) =
>> 80<0.000133>
>> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) =
>> 80<0.000159>
>> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) =
>> 80<0.000133>
>> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) =
>> 80<0.000136>
>> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) =
>> 80<0.000594>
>>
>> sessionInfo()
>> R version 2.10.0 (2009-10-26)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Biostrings_2.14.12 IRanges_1.4.16
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.6.0
>>
>>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list