[Bioc-sig-seq] write.XStringSet() terribly slow
Steffen Neumann
sneumann at ipb-halle.de
Fri Apr 16 20:48:28 CEST 2010
Hi,
Thanks for the code snippet,
it's working like a charm.
Yours,
Steffen
Am Friday, den 16.04.2010, 14:17 -0400 schrieb Kasper Daniel Hansen:
> It has been incorporated into Biostrings a while ago. But it turns
> out that it only really works if you call it like I did in an earlier
> email. In other words, there were severe problems with argument
> checking etc. And don't use write.XString - that is pretty
> inefficient. I will get it fixed so that write.XStringSet just calls
> my new writeFASTA or similar.
>
> You can test my "new" patch below. It is very fast for the genome
> case (a few, very long sequences) and (I believe) horrible inefficient
> for the short read case (many, short sequences). I know what I need
> to do to fix up the short read case, but I don't have time to do so
> until later this weekend/early next week. For your use case it may be
> fast, depending on how many sequences you have. Otherwise you will
> have to wait for me fixing the short read case.
>
> Please write back with possible bugs. Below 'x' could be a XStringSet
> (your use case) and the names of the sequences will be used as the
> FASTA record names, unless you give it a "desc" argument (character
> vector of record names).
>
> writeFASTA <- function(x, file="", desc=NULL, append=FALSE, width=80)
> {
> if (!isTRUEorFALSE(append))
> stop("'append' must be TRUE or FALSE")
> if (isSingleString(file)) {
> if (file == "") {
> file <- stdout()
> } else {
> file <- file(file, ifelse(append, "a", "w"))
> on.exit(close(file))
> }
> } else if (inherits(file, "connection")) {
> if (!isOpen(file)) {
> file <- file(file, ifelse(append, "a", "w"))
> on.exit(close(file))
> }
> } else {
> stop("'file' must be a single string or connection")
> }
> if (!isSingleNumber(width))
> stop("'width' must be an integer >= 1")
> if (!is.integer(width))
> width <- as.integer(width)
> if (width < 1L)
> stop("'width' must be an integer >= 1")
> if(!(is.character(x) || is(x, "XString") || is(x, "XStringSet") ||
> is(x, "BSgenome") || (is.list(x) && "seq" %in% names(x))))
> stop("'x' does not have the appropriate type")
> #browser()
> if(is.character(x))
> x <- BStringSet(x, use.names = TRUE)
> if(is.list(x)) {
> if(is.null(desc))
> desc <- x$desc
> x <- BStringSet(x$seq)
> }
> if(is(x, "XString")) {
> nLengths <- length(x)
> }
> if(is(x, "XStringSet")) {
> nLengths <- width(x)
> }
> if(is(x, "BSgenome")) {
> nLengths <- seqlengths(x)
> }
>
> if (!is.null(desc) && !(is.character(desc) && length(desc) ==
> length(nLengths)))
> stop("when specified, 'desc' must be a character vector of the
> same length as the 'x' object")
> if(is.null(desc))
> desc <- names(x)
> if(is.null(desc))
> desc <- rep("", length(nLengths))
> if(length(nLengths) != length(desc))
> stop("wrong length of 'desc'")
>
> writeBString <- function(bstring)
> {
> if (length(bstring) == 0L)
> return()
> nlines <- (length(bstring) - 1L) %/% width + 1L
> lineIdx <- seq_len(nlines)
> start <- (lineIdx - 1L) * width + 1L
> end <- start + width - 1L
> if (end[nlines] > length(bstring))
> end[nlines] <- length(bstring)
> bigstring <- paste(
> as.character(Views(bstring, start = start, end = end)),
> collapse="\n"
> )
> cat(bigstring, "\n", file=file, sep="")
> }
>
> if(is(x, "XString")) {
> cat(">", desc, "\n", file = file, sep = "")
> writeBString(x)
> } else {
> for (ii in seq_len(length(nLengths))) {
> cat(">", desc[ii], "\n", file = file, sep = "")
> writeBString(x[[ii]])
> }
> }
> }
>
>
> On Fri, Apr 16, 2010 at 1:55 PM, Neumann, Steffen <sneumann at ipb-halle.de> wrote:
> > Hi,
> >
> > thanks for your blazing fast answers.
> > Where did you send the patch ?
> > Can I find that somewhere ?
> >
> > Steffen
> >
> >
> > -----Original Message-----
> > From: Kasper Daniel Hansen [mailto:kasperdanielhansen at gmail.com]
> > Sent: Fri 4/16/2010 15:55
> > To: Neumann, Steffen
> > Cc: bioc-sig-sequencing at stat.math.ethz.ch
> > Subject: Re: [Bioc-sig-seq] write.XStringSet() terribly slow
> >
> > I don't know if there has been a refactoring of the code, but I while
> > ago I send a patch to writeFASTA making it magnitudes faster, so you
> > should perhaps try that one. The patch makes it pretty fast to dump
> > entire bsgenomes into fasta files.
> >
> > Kasper
> >
> > On Fri, Apr 16, 2010 at 9:17 AM, Steffen Neumann <sneumann at ipb-halle.de>
> > wrote:
> >> Hi,
> >>
> >> I have some major performance problems writing fasta files
> >> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one
> >> DNAString,
> >> and writing that to a file takes ages, as you see from the strace output
> >> below: I obtain ~5 lines (80 chars each) per second. The runtime
> >> of the system call <in brackets> is neglectible.
> >>
> >> library(Biostrings)
> >> chromosome <-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
> >> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
> >>
> >> Is there a fundamental flaw in my thinking ?
> >> Is there an alternative to write.XStringSet() ?
> >> This happens both on my laptop and a beefy server.
> >>
> >> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
> >> and get ~11 lines per second.
> >>
> >> Yours,
> >> Steffen
> >>
> >> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80
> >> <0.000137>
> >> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80
> >> <0.000142>
> >> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80
> >> <0.000133>
> >> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80
> >> <0.000159>
> >> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80
> >> <0.000133>
> >> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80
> >> <0.000136>
> >> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80
> >> <0.000594>
> >>
> >> sessionInfo()
> >> R version 2.10.0 (2009-10-26)
> >> x86_64-unknown-linux-gnu
> >>
> >> locale:
> >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> >> [9] LC_ADDRESS=C LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] stats graphics grDevices utils datasets methods base
> >>
> >> other attached packages:
> >> [1] Biostrings_2.14.12 IRanges_1.4.16
> >>
> >> loaded via a namespace (and not attached):
> >> [1] Biobase_2.6.0
> >>
> >> --
> >> IPB Halle AG Massenspektrometrie & Bioinformatik
> >> Dr. Steffen Neumann http://www.IPB-Halle.DE
> >> Weinberg 3 http://msbi.bic-gh.de
> >> 06120 Halle Tel. +49 (0) 345 5582 - 1470
> >> +49 (0) 345 5582 - 0
> >> sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409
> >>
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> Bioc-sig-sequencing at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
> >
> >
--
IPB Halle AG Massenspektrometrie & Bioinformatik
Dr. Steffen Neumann http://www.IPB-Halle.DE
Weinberg 3 http://msbi.bic-gh.de
06120 Halle Tel. +49 (0) 345 5582 - 1470
+49 (0) 345 5582 - 0
sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409
More information about the Bioc-sig-sequencing
mailing list