[Bioc-sig-seq] Optimizing the generation of reads w/Biostrings.

Hervé Pagès hpages at fhcrc.org
Sat Mar 20 20:28:33 CET 2010


Hi Jim,

This is an interesting use case and it reveals some of the limitations
of the current Biostrings infrastructure. For example some operations
like sample() don't work natively on DNAString or DNAStringSet objects.
Having this capability would certainly help here but there are other
bits that would need to be added to make things even easier, faster,
and more memory efficient than the workaround I propose below.

Writing the code below (and get it right, I think) took me a fair
amount of time but was a really helpful exercise because it helped
me identify/understand the bottlenecks. It still does too many
coercions between the XString, character, and raw representations,
and some of them could (and should) definitely be avoided if we had
those missing bits in Biostrings.

So for now one option is to do the sampling in the character space
which requires that you start converting your big DNAStringSet object
back to a standard character vector. Then I see that you use
strsplit( , NULL) on this character vector which generates a huge
object (8G for your 14 millions 75-mers on a 64 bit machine, because
each 1-letter string here will take 8 bytes in memory).

Using a "break-process-recombine" approach i.e. breaking the big
object into smaller pieces, processing them separately, and then
recombining the results can help reduce the memory footprint
dramatically without impacting too much the speed. In the end, I'm
able to mutate your 14 million 75-mers with a substitution matrix
that is cycle dependent, in about 3 minutes and with about 3.8Gb
of RAM (I did this on my laptop).

But first I'll start with an illustration of the
"break-process-recombine" approach for the reads generation.

   library(Biostrings)

   breakInBlocks <- function(totalsize, bsize)
   {
     quot <- totalsize %/% bsize
     ans_width <- rep.int(bsize, quot)
     rem <- totalsize %% bsize
     if (rem > 0L)
         ans_width <- c(ans_width, rem)
     IRanges(end=cumsum(ans_width), width=ans_width)
   }

   randomReads <- function(nread, readlength)
   {
     totalsize <- nread * readlength
     bsize <- 500000L
     blocks <- breakInBlocks(totalsize, bsize)
     tmp <- lapply(width(blocks),
              function(bs)
                DNAString(paste(sample(DNA_BASES, replace=TRUE, size=bs),
                                collapse="")))
     ans_subject <- do.call(c, tmp)
     successiveViews(ans_subject, rep.int(readlength, nread))
   }

Then generating 2 million 75-mers with 'randomReads(2000000L, 75L)'
takes 37 sec. and requires 343Mb of memory on my system whereas
without the "break-process-recombine" approach it requires 10x
more memory!

Generate the 14 million 75-mers:

   > system.time(reads <- randomReads(14000000L, 75L))
      user  system elapsed
   208.817   2.356 211.534
   > gc()
               used   (Mb) gc trigger   (Mb)  max used   (Mb)
   Ncells    924564   49.4    1590760   85.0   1590760   85.0
   Vcells 145696189 1111.6  353180113 2694.6 332717020 2538.5

The same approach can be used to mutate the reads. In a first pass
I address the simpler case where the substitution matrix is cycle
independent, which allows for even more optimization. One additional
optimization here is to use a raw vector instead of a vector of 1-letter
character strings as the input and output of sample():

   ## Here 's' must be a single character string.
   mutateString <- function(s, SubM)
   {
     bytes <- charToRaw(s)
     colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
     for (letter in rownames(SubM)) {
         b <- charToRaw(letter)
         idx <- bytes == b
         bytes[bytes == b] <- sample(colbytes, sum(idx),
                                     replace=TRUE, prob=SubM[letter, ])
     }
     rawToChar(bytes)
   }

   ## Here 'v' must be a Views object with a DNAString subject.
   mutateReads <- function(v, SubM)
   {
     totalsize <- length(subject(v))
     bsize <- 1000000L
     blocks <- breakInBlocks(totalsize, bsize)
     tmp <- lapply(seq_len(length(blocks)),
              function(i) {
                s <- subseq(subject(v),
                       start=start(blocks)[i], width=width(blocks)[i])
                DNAString(mutateString(as.character(s), SubM))
              })
     ans_subject <- do.call(c, tmp)
     successiveViews(ans_subject, width(v))
   }

   SubM <- matrix(0.0033, nrow=4, ncol=4)
   diag(SubM) <- 1 - sum(SubM[1, 2:4])
   rownames(SubM) <- colnames(SubM) <- DNA_BASES

   > system.time(mreads <- mutateReads(reads, SubM))
      user  system elapsed
   127.828   1.584 129.674
   > gc()
               used   (Mb) gc trigger   (Mb)  max used   (Mb)
   Ncells    925420   49.5    1590760   85.0   1590760   85.0
   Vcells 290946618 2219.8  450911592 3440.2 436207956 3328.1


Finally the more general case where the substitution matrix is cycle
dependent:

   ## Here 'm' must be a raw matrix object with 1 read per row
   ## and 'SubMs' a tridimensional numeric array where 'SubMs[ , , i]'
   ## is the substitution matrix for cycle 'i'.
   mutateRawMatrix <- function(m, SubMs)
   {
     for (i in seq_len(ncol(m))) {
         bytes <- m[ , i]
         SubM <- SubMs[ , , i]
         colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
         for (letter in rownames(SubM)) {
             b <- charToRaw(letter)
             idx <- bytes == b
             bytes[bytes == b] <- sample(colbytes, sum(idx),
                                    replace=TRUE, prob=SubM[letter, ])
         }
         m[ , i] <- bytes
     }
     m
   }

   mkSubM <- function(mutprob=0)
   {
     SubM <- matrix(mutprob, nrow=4, ncol=4)
     diag(SubM) <- 1 - sum(SubM[1, 2:4])
     rownames(SubM) <- colnames(SubM) <- DNA_BASES
     SubM
   }

   SubMs <- array(
     do.call(c, lapply(1:75, function(i) mkSubM(0.0033+(i-1)*0.002))),
     dim=c(4,4,75), dimnames=list(DNA_BASES, DNA_BASES, NULL))

   mutateReads2 <- function(v, SubMs)
   {
     if (!isConstant(width(v)))
         stop("'v' must be rectangular")
     vwidth <- width(v)[1L]
     if (length(dim(SubMs)) != 3L || dim(SubMs)[3L] < vwidth)
         stop("invalid 'SubMs' (wrong dimensions)")
     totalsize <- length(v)
     bsize <- 250000L
     blocks <- breakInBlocks(totalsize, bsize)
     tmp <- lapply(seq_len(length(blocks)),
              function(i) {
                vv <- v[as.integer(blocks[i])]
                s <- subseq(subject(v),
                       start=start(vv)[1L], end=end(vv)[length(vv)])
                m <- matrix(charToRaw(as.character(s)),
                       ncol=vwidth, byrow=TRUE)
                DNAString(rawToChar(t(mutateRawMatrix(m, SubMs))))
              })
     ans_subject <- do.call(c, tmp)
     successiveViews(ans_subject, width(v))
   }

   > system.time(mreads2 <- mutateReads2(reads, SubMs))
      user  system elapsed
   178.043   1.812 180.331
   > gc()
               used   (Mb) gc trigger   (Mb)  max used   (Mb)
   Ncells    955107   51.1    1590760   85.0   1590760   85.0
   Vcells 293551309 2239.7  497294029 3794.1 491180530 3747.5

Cheers,
H.


bullard at stat.berkeley.edu wrote:
> Hi All, I have been trying (rather futilely) to optimize the performance
> of some simulations which I am trying to do. Briefly, I have a set of
> reads which are sampled from a genome. These reads are stored in a
> DNAStringSet. The next thing I want to do is apply an error model to
> 'mutate' the reads.
> 
> Below, are three implementations for the same procedure. The first is very
> slow and doesn't leverage vectorization. The second is faster, leverages
> vectorization, but is done in a way that is counter-intuitive (one thing
> to note is that in practice the substitution matrices might be cycle
> dependent, hence why I walk over the columns). The third way seemed good,
> but there is so much conversion that for 14,000,000 reads I use > 20 Gigs
> of memory? Naively, that many reads should be about 1 Gig.
> 
> I am looking for low-level Biostrings functionality. something like
> applyString, or applyBase. I cannot fathom why this should take so long or
> consume so much memory.
> 
> thanks, jim
> 
> ## the read-set (note that I want to do this for 1e7 reads)
> BASES <- c("A", "C", "G", "T")
> reads <- DNAStringSet(sapply(1:2000, function(i) {
>   paste(sample(BASES, replace = T, size = 75),
>         collapse = "")
> }))
> 
> SubM <- structure(c(0.99, 0.0033, 0.0033, 0.0033, 0.0033, 0.99,
>                     0.0033,  0.0033, 0.0033, 0.0033, 0.99, 0.0033,
>                     0.0033, 0.0033, 0.0033,  0.99),
>                   .Dim = c(4L, 4L), .Dimnames = list(BASES, BASES))
> 
> f.naive <- function(ds) {
>   DNAStringSet(sapply(strsplit(as.character(ds), ""), function(read) {
>     paste(sapply(read, function(b) sample(BASES, prob = SubM[b,], size =
> 1)), collapse = "")
>   }))
> }
> 
> f.lessSo <- function(ds) {
>   cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>             'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
> 
>   V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>   V <- apply(do.call(cbind, lapply(1:ncol(V), function(i) {
>     sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
>   })), 1, paste, collapse = "")
> 
>   DNAStringSet(V)
> }
> 
> f.withViews <- function(ds) {
>   cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>             'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
> 
>   V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>   V <- DNAString(paste(as.character(t(do.call(cbind, lapply(1:ncol(V),
> function(i) {
>     sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
>   })))), collapse = ""))
>   DNAStringSet(Views(V, start = seq(1, length(V), by = 75),
>                      end = seq(75, length(V), by = 75)))
> }
> 
> 
> system.time(r <- f.naive(reads))
> ###    user  system elapsed
> ###   2.820   0.000   2.822
> 
> system.time(r <- f.lessSo(reads))
> ###    user  system elapsed
> ###   0.190   0.000   0.189
> 
> system.time(r <- f.withViews(reads))
> ###    user  system elapsed
> ###   0.090   0.010   0.097
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list