[Bioc-sig-seq] Optimizing the generation of reads w/Biostrings.

Sun Mar 21 00:14:47 CET 2010

I need to correct this a little bit. Yes sample() works on DNAString
and DNAStringSet objects (and on any object with a "length" and a "["
method). So randomReads() is better implemented like this:

   randomReads <- function(nread, readlength)
   {
     totalsize <- nread * readlength
     bsize <- 500000L
     blocks <- breakInBlocks(totalsize, bsize)
     tmp <- lapply(width(blocks),
              function(bs)
                sample(DNAString(paste(DNA_BASES, collapse="")),
                       bs, replace=TRUE))
     ans_subject <- do.call(c, tmp)
     successiveViews(ans_subject, rep.int(readlength, nread))
   }

It's about 3x faster than my previous version.

Cheers,
H.

Hervé Pagès wrote:
> Hi Jim,
> 
> This is an interesting use case and it reveals some of the limitations
> of the current Biostrings infrastructure. For example some operations
> like sample() don't work natively on DNAString or DNAStringSet objects.
> Having this capability would certainly help here but there are other
> bits that would need to be added to make things even easier, faster,
> and more memory efficient than the workaround I propose below.
> 
> Writing the code below (and get it right, I think) took me a fair
> amount of time but was a really helpful exercise because it helped
> me identify/understand the bottlenecks. It still does too many
> coercions between the XString, character, and raw representations,
> and some of them could (and should) definitely be avoided if we had
> those missing bits in Biostrings.
> 
> So for now one option is to do the sampling in the character space
> which requires that you start converting your big DNAStringSet object
> back to a standard character vector. Then I see that you use
> strsplit( , NULL) on this character vector which generates a huge
> object (8G for your 14 millions 75-mers on a 64 bit machine, because
> each 1-letter string here will take 8 bytes in memory).
> 
> Using a "break-process-recombine" approach i.e. breaking the big
> object into smaller pieces, processing them separately, and then
> recombining the results can help reduce the memory footprint
> dramatically without impacting too much the speed. In the end, I'm
> able to mutate your 14 million 75-mers with a substitution matrix
> that is cycle dependent, in about 3 minutes and with about 3.8Gb
> of RAM (I did this on my laptop).
> 
> But first I'll start with an illustration of the
> "break-process-recombine" approach for the reads generation.
> 
>   library(Biostrings)
> 
>   breakInBlocks <- function(totalsize, bsize)
>   {
>     quot <- totalsize %/% bsize
>     ans_width <- rep.int(bsize, quot)
>     rem <- totalsize %% bsize
>     if (rem > 0L)
>         ans_width <- c(ans_width, rem)
>     IRanges(end=cumsum(ans_width), width=ans_width)
>   }
> 
>   randomReads <- function(nread, readlength)
>   {
>     totalsize <- nread * readlength
>     bsize <- 500000L
>     blocks <- breakInBlocks(totalsize, bsize)
>     tmp <- lapply(width(blocks),
>              function(bs)
>                DNAString(paste(sample(DNA_BASES, replace=TRUE, size=bs),
>                                collapse="")))
>     ans_subject <- do.call(c, tmp)
>     successiveViews(ans_subject, rep.int(readlength, nread))
>   }
> 
> Then generating 2 million 75-mers with 'randomReads(2000000L, 75L)'
> takes 37 sec. and requires 343Mb of memory on my system whereas
> without the "break-process-recombine" approach it requires 10x
> more memory!
> 
> Generate the 14 million 75-mers:
> 
>   > system.time(reads <- randomReads(14000000L, 75L))
>      user  system elapsed
>   208.817   2.356 211.534
>   > gc()
>               used   (Mb) gc trigger   (Mb)  max used   (Mb)
>   Ncells    924564   49.4    1590760   85.0   1590760   85.0
>   Vcells 145696189 1111.6  353180113 2694.6 332717020 2538.5
> 
> The same approach can be used to mutate the reads. In a first pass
> I address the simpler case where the substitution matrix is cycle
> independent, which allows for even more optimization. One additional
> optimization here is to use a raw vector instead of a vector of 1-letter
> character strings as the input and output of sample():
> 
>   ## Here 's' must be a single character string.
>   mutateString <- function(s, SubM)
>   {
>     bytes <- charToRaw(s)
>     colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
>     for (letter in rownames(SubM)) {
>         b <- charToRaw(letter)
>         idx <- bytes == b
>         bytes[bytes == b] <- sample(colbytes, sum(idx),
>                                     replace=TRUE, prob=SubM[letter, ])
>     }
>     rawToChar(bytes)
>   }
> 
>   ## Here 'v' must be a Views object with a DNAString subject.
>   mutateReads <- function(v, SubM)
>   {
>     totalsize <- length(subject(v))
>     bsize <- 1000000L
>     blocks <- breakInBlocks(totalsize, bsize)
>     tmp <- lapply(seq_len(length(blocks)),
>              function(i) {
>                s <- subseq(subject(v),
>                       start=start(blocks)[i], width=width(blocks)[i])
>                DNAString(mutateString(as.character(s), SubM))
>              })
>     ans_subject <- do.call(c, tmp)
>     successiveViews(ans_subject, width(v))
>   }
> 
>   SubM <- matrix(0.0033, nrow=4, ncol=4)
>   diag(SubM) <- 1 - sum(SubM[1, 2:4])
>   rownames(SubM) <- colnames(SubM) <- DNA_BASES
> 
>   > system.time(mreads <- mutateReads(reads, SubM))
>      user  system elapsed
>   127.828   1.584 129.674
>   > gc()
>               used   (Mb) gc trigger   (Mb)  max used   (Mb)
>   Ncells    925420   49.5    1590760   85.0   1590760   85.0
>   Vcells 290946618 2219.8  450911592 3440.2 436207956 3328.1
> 
> 
> Finally the more general case where the substitution matrix is cycle
> dependent:
> 
>   ## Here 'm' must be a raw matrix object with 1 read per row
>   ## and 'SubMs' a tridimensional numeric array where 'SubMs[ , , i]'
>   ## is the substitution matrix for cycle 'i'.
>   mutateRawMatrix <- function(m, SubMs)
>   {
>     for (i in seq_len(ncol(m))) {
>         bytes <- m[ , i]
>         SubM <- SubMs[ , , i]
>         colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
>         for (letter in rownames(SubM)) {
>             b <- charToRaw(letter)
>             idx <- bytes == b
>             bytes[bytes == b] <- sample(colbytes, sum(idx),
>                                    replace=TRUE, prob=SubM[letter, ])
>         }
>         m[ , i] <- bytes
>     }
>     m
>   }
> 
>   mkSubM <- function(mutprob=0)
>   {
>     SubM <- matrix(mutprob, nrow=4, ncol=4)
>     diag(SubM) <- 1 - sum(SubM[1, 2:4])
>     rownames(SubM) <- colnames(SubM) <- DNA_BASES
>     SubM
>   }
> 
>   SubMs <- array(
>     do.call(c, lapply(1:75, function(i) mkSubM(0.0033+(i-1)*0.002))),
>     dim=c(4,4,75), dimnames=list(DNA_BASES, DNA_BASES, NULL))
> 
>   mutateReads2 <- function(v, SubMs)
>   {
>     if (!isConstant(width(v)))
>         stop("'v' must be rectangular")
>     vwidth <- width(v)[1L]
>     if (length(dim(SubMs)) != 3L || dim(SubMs)[3L] < vwidth)
>         stop("invalid 'SubMs' (wrong dimensions)")
>     totalsize <- length(v)
>     bsize <- 250000L
>     blocks <- breakInBlocks(totalsize, bsize)
>     tmp <- lapply(seq_len(length(blocks)),
>              function(i) {
>                vv <- v[as.integer(blocks[i])]
>                s <- subseq(subject(v),
>                       start=start(vv)[1L], end=end(vv)[length(vv)])
>                m <- matrix(charToRaw(as.character(s)),
>                       ncol=vwidth, byrow=TRUE)
>                DNAString(rawToChar(t(mutateRawMatrix(m, SubMs))))
>              })
>     ans_subject <- do.call(c, tmp)
>     successiveViews(ans_subject, width(v))
>   }
> 
>   > system.time(mreads2 <- mutateReads2(reads, SubMs))
>      user  system elapsed
>   178.043   1.812 180.331
>   > gc()
>               used   (Mb) gc trigger   (Mb)  max used   (Mb)
>   Ncells    955107   51.1    1590760   85.0   1590760   85.0
>   Vcells 293551309 2239.7  497294029 3794.1 491180530 3747.5
> 
> Cheers,
> H.
> 
> 
> bullard at stat.berkeley.edu wrote:
>> Hi All, I have been trying (rather futilely) to optimize the performance
>> of some simulations which I am trying to do. Briefly, I have a set of
>> reads which are sampled from a genome. These reads are stored in a
>> DNAStringSet. The next thing I want to do is apply an error model to
>> 'mutate' the reads.
>>
>> Below, are three implementations for the same procedure. The first is 
>> very
>> slow and doesn't leverage vectorization. The second is faster, leverages
>> vectorization, but is done in a way that is counter-intuitive (one thing
>> to note is that in practice the substitution matrices might be cycle
>> dependent, hence why I walk over the columns). The third way seemed good,
>> but there is so much conversion that for 14,000,000 reads I use > 20 Gigs
>> of memory? Naively, that many reads should be about 1 Gig.
>>
>> I am looking for low-level Biostrings functionality. something like
>> applyString, or applyBase. I cannot fathom why this should take so 
>> long or
>> consume so much memory.
>>
>> thanks, jim
>>
>> ## the read-set (note that I want to do this for 1e7 reads)
>> BASES <- c("A", "C", "G", "T")
>> reads <- DNAStringSet(sapply(1:2000, function(i) {
>>   paste(sample(BASES, replace = T, size = 75),
>>         collapse = "")
>> }))
>>
>> SubM <- structure(c(0.99, 0.0033, 0.0033, 0.0033, 0.0033, 0.99,
>>                     0.0033,  0.0033, 0.0033, 0.0033, 0.99, 0.0033,
>>                     0.0033, 0.0033, 0.0033,  0.99),
>>                   .Dim = c(4L, 4L), .Dimnames = list(BASES, BASES))
>>
>> f.naive <- function(ds) {
>>   DNAStringSet(sapply(strsplit(as.character(ds), ""), function(read) {
>>     paste(sapply(read, function(b) sample(BASES, prob = SubM[b,], size =
>> 1)), collapse = "")
>>   }))
>> }
>>
>> f.lessSo <- function(ds) {
>>   cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>>             'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>
>>   V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>>   V <- apply(do.call(cbind, lapply(1:ncol(V), function(i) {
>>     sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = 
>> TRUE)
>>   })), 1, paste, collapse = "")
>>
>>   DNAStringSet(V)
>> }
>>
>> f.withViews <- function(ds) {
>>   cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>>             'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>
>>   V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>>   V <- DNAString(paste(as.character(t(do.call(cbind, lapply(1:ncol(V),
>> function(i) {
>>     sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = 
>> TRUE)
>>   })))), collapse = ""))
>>   DNAStringSet(Views(V, start = seq(1, length(V), by = 75),
>>                      end = seq(75, length(V), by = 75)))
>> }
>>
>>
>> system.time(r <- f.naive(reads))
>> ###    user  system elapsed
>> ###   2.820   0.000   2.822
>>
>> system.time(r <- f.lessSo(reads))
>> ###    user  system elapsed
>> ###   0.190   0.000   0.189
>>
>> system.time(r <- f.withViews(reads))
>> ###    user  system elapsed
>> ###   0.090   0.010   0.097
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319