[Bioc-sig-seq] write.XStringSet() terribly slow

Fri Apr 16 20:48:28 CEST 2010

Hi,

Thanks for the code snippet,
it's working like a charm.

Yours,
Steffen

Am Friday, den 16.04.2010, 14:17 -0400 schrieb Kasper Daniel Hansen:
> It has been incorporated into Biostrings a while ago.  But it turns
> out that it only really works if you call it like I did in an earlier
> email.  In other words, there were severe problems with argument
> checking etc.  And don't use write.XString - that is pretty
> inefficient.  I will get it fixed so that write.XStringSet just calls
> my new writeFASTA or similar.
> 
> You can test my "new" patch below.  It is very fast for the genome
> case (a few, very long sequences) and (I believe) horrible inefficient
> for the short read case (many, short sequences).  I know what I need
> to do to fix up the short read case, but I don't have time to do so
> until later this weekend/early next week.  For your use case it may be
> fast, depending on how many sequences you have.  Otherwise you will
> have to wait for me fixing the short read case.
> 
> Please write back with possible bugs.  Below 'x' could be a XStringSet
> (your use case) and the names of the sequences will be used as the
> FASTA record names, unless you give it a "desc" argument (character
> vector of record names).
> 
> writeFASTA <- function(x, file="", desc=NULL, append=FALSE, width=80)
> {
>     if (!isTRUEorFALSE(append))
>         stop("'append' must be TRUE or FALSE")
>     if (isSingleString(file)) {
>         if (file == "") {
>             file <- stdout()
>         } else {
>             file <- file(file, ifelse(append, "a", "w"))
>             on.exit(close(file))
>         }
>     } else if (inherits(file, "connection")) {
>         if (!isOpen(file)) {
>             file <- file(file, ifelse(append, "a", "w"))
>             on.exit(close(file))
>         }
>     } else {
>         stop("'file' must be a single string or connection")
>     }
>     if (!isSingleNumber(width))
>         stop("'width' must be an integer >= 1")
>     if (!is.integer(width))
>         width <- as.integer(width)
>     if (width < 1L)
>         stop("'width' must be an integer >= 1")
>     if(!(is.character(x) || is(x, "XString") || is(x, "XStringSet") ||
>          is(x, "BSgenome") || (is.list(x) && "seq" %in% names(x))))
>         stop("'x' does not have the appropriate type")
>     #browser()
>     if(is.character(x))
>         x <- BStringSet(x, use.names = TRUE)
>     if(is.list(x)) {
>         if(is.null(desc))
>             desc <- x$desc
>         x <- BStringSet(x$seq)
>     }
>     if(is(x, "XString")) {
>         nLengths <- length(x)
>     }
>     if(is(x, "XStringSet")) {
>         nLengths <- width(x)
>     }
>     if(is(x, "BSgenome")) {
>         nLengths <- seqlengths(x)
>     }
> 
>     if (!is.null(desc) && !(is.character(desc) && length(desc) ==
> length(nLengths)))
>         stop("when specified, 'desc' must be a character vector of the
> same length as the 'x' object")
>     if(is.null(desc))
>         desc <- names(x)
>     if(is.null(desc))
>         desc <- rep("", length(nLengths))
>     if(length(nLengths) != length(desc))
>         stop("wrong length of 'desc'")
> 
>     writeBString <- function(bstring)
>     {
>         if (length(bstring) == 0L)
>             return()
>         nlines <- (length(bstring) - 1L) %/% width + 1L
>         lineIdx <- seq_len(nlines)
>         start <- (lineIdx - 1L) * width + 1L
>         end <- start + width - 1L
>         if (end[nlines] > length(bstring))
>             end[nlines] <- length(bstring)
>         bigstring <- paste(
>             as.character(Views(bstring, start = start, end = end)),
>             collapse="\n"
>         )
>         cat(bigstring, "\n", file=file, sep="")
>     }
> 
>     if(is(x, "XString")) {
>         cat(">", desc, "\n", file = file, sep = "")
>         writeBString(x)
>     } else {
>         for (ii in seq_len(length(nLengths))) {
>             cat(">", desc[ii], "\n", file = file, sep = "")
>             writeBString(x[[ii]])
>         }
>     }
> }
> 
> 
> On Fri, Apr 16, 2010 at 1:55 PM, Neumann, Steffen <sneumann at ipb-halle.de> wrote:
> > Hi,
> >
> > thanks for your blazing fast answers.
> > Where did you send the patch ?
> > Can I find that somewhere ?
> >
> > Steffen
> >
> >
> > -----Original Message-----
> > From:   Kasper Daniel Hansen [mailto:kasperdanielhansen at gmail.com]
> > Sent:   Fri 4/16/2010 15:55
> > To:     Neumann, Steffen
> > Cc:     bioc-sig-sequencing at stat.math.ethz.ch
> > Subject:        Re: [Bioc-sig-seq] write.XStringSet() terribly slow
> >
> > I don't know if there has been a refactoring of the code, but I while
> > ago I send a patch to writeFASTA making it magnitudes faster, so you
> > should perhaps try that one.  The patch makes it pretty fast to dump
> > entire bsgenomes into fasta files.
> >
> > Kasper
> >
> > On Fri, Apr 16, 2010 at 9:17 AM, Steffen Neumann <sneumann at ipb-halle.de>
> > wrote:
> >> Hi,
> >>
> >> I have some major performance problems writing fasta files
> >> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one
> >> DNAString,
> >> and writing that to a file takes ages, as you see from the strace output
> >> below: I obtain ~5 lines (80 chars each) per second. The runtime
> >> of the system call <in brackets> is neglectible.
> >>
> >> library(Biostrings)
> >> chromosome <-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
> >> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
> >>
> >> Is there a fundamental flaw in my thinking ?
> >> Is there an alternative to write.XStringSet() ?
> >> This happens both on my laptop and a beefy server.
> >>
> >> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
> >> and get ~11 lines per second.
> >>
> >> Yours,
> >> Steffen
> >>
> >> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80
> >> <0.000137>
> >> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80
> >> <0.000142>
> >> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80
> >> <0.000133>
> >> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80
> >> <0.000159>
> >> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80
> >> <0.000133>
> >> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80
> >> <0.000136>
> >> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80
> >> <0.000594>
> >>
> >> sessionInfo()
> >> R version 2.10.0 (2009-10-26)
> >> x86_64-unknown-linux-gnu
> >>
> >> locale:
> >>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> >>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>
> >> other attached packages:
> >> [1] Biostrings_2.14.12 IRanges_1.4.16
> >>
> >> loaded via a namespace (and not attached):
> >> [1] Biobase_2.6.0
> >>
> >> --
> >> IPB Halle                    AG Massenspektrometrie & Bioinformatik
> >> Dr. Steffen Neumann          http://www.IPB-Halle.DE
> >> Weinberg 3                   http://msbi.bic-gh.de
> >> 06120 Halle                  Tel. +49 (0) 345 5582 - 1470
> >>                                  +49 (0) 345 5582 - 0
> >> sneumann(at)IPB-Halle.DE     Fax. +49 (0) 345 5582 - 1409
> >>
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> Bioc-sig-sequencing at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
> >
> >

-- 
IPB Halle                    AG Massenspektrometrie & Bioinformatik
Dr. Steffen Neumann          http://www.IPB-Halle.DE
Weinberg 3                   http://msbi.bic-gh.de
06120 Halle                  Tel. +49 (0) 345 5582 - 1470
                                  +49 (0) 345 5582 - 0
sneumann(at)IPB-Halle.DE     Fax. +49 (0) 345 5582 - 1409