[Bioc-sig-seq] Optimizing the generation of reads w/Biostrings.

Mon Mar 22 02:39:42 CET 2010

Hi Herve, this is great. I am going to take your code and benchmark on my system. In the meantime, I wrote a python version, which did outperform my R versions, but now we might much more competitive. 

thanks again, jim

On Mar 20, 2010, at 4:14 PM, Hervé Pagès wrote:

> I need to correct this a little bit. Yes sample() works on DNAString
> and DNAStringSet objects (and on any object with a "length" and a "["
> method). So randomReads() is better implemented like this:
> 
>  randomReads <- function(nread, readlength)
>  {
>    totalsize <- nread * readlength
>    bsize <- 500000L
>    blocks <- breakInBlocks(totalsize, bsize)
>    tmp <- lapply(width(blocks),
>             function(bs)
>               sample(DNAString(paste(DNA_BASES, collapse="")),
>                      bs, replace=TRUE))
>    ans_subject <- do.call(c, tmp)
>    successiveViews(ans_subject, rep.int(readlength, nread))
>  }
> 
> It's about 3x faster than my previous version.
> 
> Cheers,
> H.
> 
> 
> Hervé Pagès wrote:
>> Hi Jim,
>> This is an interesting use case and it reveals some of the limitations
>> of the current Biostrings infrastructure. For example some operations
>> like sample() don't work natively on DNAString or DNAStringSet objects.
>> Having this capability would certainly help here but there are other
>> bits that would need to be added to make things even easier, faster,
>> and more memory efficient than the workaround I propose below.
>> Writing the code below (and get it right, I think) took me a fair
>> amount of time but was a really helpful exercise because it helped
>> me identify/understand the bottlenecks. It still does too many
>> coercions between the XString, character, and raw representations,
>> and some of them could (and should) definitely be avoided if we had
>> those missing bits in Biostrings.
>> So for now one option is to do the sampling in the character space
>> which requires that you start converting your big DNAStringSet object
>> back to a standard character vector. Then I see that you use
>> strsplit( , NULL) on this character vector which generates a huge
>> object (8G for your 14 millions 75-mers on a 64 bit machine, because
>> each 1-letter string here will take 8 bytes in memory).
>> Using a "break-process-recombine" approach i.e. breaking the big
>> object into smaller pieces, processing them separately, and then
>> recombining the results can help reduce the memory footprint
>> dramatically without impacting too much the speed. In the end, I'm
>> able to mutate your 14 million 75-mers with a substitution matrix
>> that is cycle dependent, in about 3 minutes and with about 3.8Gb
>> of RAM (I did this on my laptop).
>> But first I'll start with an illustration of the
>> "break-process-recombine" approach for the reads generation.
>>  library(Biostrings)
>>  breakInBlocks <- function(totalsize, bsize)
>>  {
>>    quot <- totalsize %/% bsize
>>    ans_width <- rep.int(bsize, quot)
>>    rem <- totalsize %% bsize
>>    if (rem > 0L)
>>        ans_width <- c(ans_width, rem)
>>    IRanges(end=cumsum(ans_width), width=ans_width)
>>  }
>>  randomReads <- function(nread, readlength)
>>  {
>>    totalsize <- nread * readlength
>>    bsize <- 500000L
>>    blocks <- breakInBlocks(totalsize, bsize)
>>    tmp <- lapply(width(blocks),
>>             function(bs)
>>               DNAString(paste(sample(DNA_BASES, replace=TRUE, size=bs),
>>                               collapse="")))
>>    ans_subject <- do.call(c, tmp)
>>    successiveViews(ans_subject, rep.int(readlength, nread))
>>  }
>> Then generating 2 million 75-mers with 'randomReads(2000000L, 75L)'
>> takes 37 sec. and requires 343Mb of memory on my system whereas
>> without the "break-process-recombine" approach it requires 10x
>> more memory!
>> Generate the 14 million 75-mers:
>>  > system.time(reads <- randomReads(14000000L, 75L))
>>     user  system elapsed
>>  208.817   2.356 211.534
>>  > gc()
>>              used   (Mb) gc trigger   (Mb)  max used   (Mb)
>>  Ncells    924564   49.4    1590760   85.0   1590760   85.0
>>  Vcells 145696189 1111.6  353180113 2694.6 332717020 2538.5
>> The same approach can be used to mutate the reads. In a first pass
>> I address the simpler case where the substitution matrix is cycle
>> independent, which allows for even more optimization. One additional
>> optimization here is to use a raw vector instead of a vector of 1-letter
>> character strings as the input and output of sample():
>>  ## Here 's' must be a single character string.
>>  mutateString <- function(s, SubM)
>>  {
>>    bytes <- charToRaw(s)
>>    colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
>>    for (letter in rownames(SubM)) {
>>        b <- charToRaw(letter)
>>        idx <- bytes == b
>>        bytes[bytes == b] <- sample(colbytes, sum(idx),
>>                                    replace=TRUE, prob=SubM[letter, ])
>>    }
>>    rawToChar(bytes)
>>  }
>>  ## Here 'v' must be a Views object with a DNAString subject.
>>  mutateReads <- function(v, SubM)
>>  {
>>    totalsize <- length(subject(v))
>>    bsize <- 1000000L
>>    blocks <- breakInBlocks(totalsize, bsize)
>>    tmp <- lapply(seq_len(length(blocks)),
>>             function(i) {
>>               s <- subseq(subject(v),
>>                      start=start(blocks)[i], width=width(blocks)[i])
>>               DNAString(mutateString(as.character(s), SubM))
>>             })
>>    ans_subject <- do.call(c, tmp)
>>    successiveViews(ans_subject, width(v))
>>  }
>>  SubM <- matrix(0.0033, nrow=4, ncol=4)
>>  diag(SubM) <- 1 - sum(SubM[1, 2:4])
>>  rownames(SubM) <- colnames(SubM) <- DNA_BASES
>>  > system.time(mreads <- mutateReads(reads, SubM))
>>     user  system elapsed
>>  127.828   1.584 129.674
>>  > gc()
>>              used   (Mb) gc trigger   (Mb)  max used   (Mb)
>>  Ncells    925420   49.5    1590760   85.0   1590760   85.0
>>  Vcells 290946618 2219.8  450911592 3440.2 436207956 3328.1
>> Finally the more general case where the substitution matrix is cycle
>> dependent:
>>  ## Here 'm' must be a raw matrix object with 1 read per row
>>  ## and 'SubMs' a tridimensional numeric array where 'SubMs[ , , i]'
>>  ## is the substitution matrix for cycle 'i'.
>>  mutateRawMatrix <- function(m, SubMs)
>>  {
>>    for (i in seq_len(ncol(m))) {
>>        bytes <- m[ , i]
>>        SubM <- SubMs[ , , i]
>>        colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
>>        for (letter in rownames(SubM)) {
>>            b <- charToRaw(letter)
>>            idx <- bytes == b
>>            bytes[bytes == b] <- sample(colbytes, sum(idx),
>>                                   replace=TRUE, prob=SubM[letter, ])
>>        }
>>        m[ , i] <- bytes
>>    }
>>    m
>>  }
>>  mkSubM <- function(mutprob=0)
>>  {
>>    SubM <- matrix(mutprob, nrow=4, ncol=4)
>>    diag(SubM) <- 1 - sum(SubM[1, 2:4])
>>    rownames(SubM) <- colnames(SubM) <- DNA_BASES
>>    SubM
>>  }
>>  SubMs <- array(
>>    do.call(c, lapply(1:75, function(i) mkSubM(0.0033+(i-1)*0.002))),
>>    dim=c(4,4,75), dimnames=list(DNA_BASES, DNA_BASES, NULL))
>>  mutateReads2 <- function(v, SubMs)
>>  {
>>    if (!isConstant(width(v)))
>>        stop("'v' must be rectangular")
>>    vwidth <- width(v)[1L]
>>    if (length(dim(SubMs)) != 3L || dim(SubMs)[3L] < vwidth)
>>        stop("invalid 'SubMs' (wrong dimensions)")
>>    totalsize <- length(v)
>>    bsize <- 250000L
>>    blocks <- breakInBlocks(totalsize, bsize)
>>    tmp <- lapply(seq_len(length(blocks)),
>>             function(i) {
>>               vv <- v[as.integer(blocks[i])]
>>               s <- subseq(subject(v),
>>                      start=start(vv)[1L], end=end(vv)[length(vv)])
>>               m <- matrix(charToRaw(as.character(s)),
>>                      ncol=vwidth, byrow=TRUE)
>>               DNAString(rawToChar(t(mutateRawMatrix(m, SubMs))))
>>             })
>>    ans_subject <- do.call(c, tmp)
>>    successiveViews(ans_subject, width(v))
>>  }
>>  > system.time(mreads2 <- mutateReads2(reads, SubMs))
>>     user  system elapsed
>>  178.043   1.812 180.331
>>  > gc()
>>              used   (Mb) gc trigger   (Mb)  max used   (Mb)
>>  Ncells    955107   51.1    1590760   85.0   1590760   85.0
>>  Vcells 293551309 2239.7  497294029 3794.1 491180530 3747.5
>> Cheers,
>> H.
>> bullard at stat.berkeley.edu wrote:
>>> Hi All, I have been trying (rather futilely) to optimize the performance
>>> of some simulations which I am trying to do. Briefly, I have a set of
>>> reads which are sampled from a genome. These reads are stored in a
>>> DNAStringSet. The next thing I want to do is apply an error model to
>>> 'mutate' the reads.
>>> 
>>> Below, are three implementations for the same procedure. The first is very
>>> slow and doesn't leverage vectorization. The second is faster, leverages
>>> vectorization, but is done in a way that is counter-intuitive (one thing
>>> to note is that in practice the substitution matrices might be cycle
>>> dependent, hence why I walk over the columns). The third way seemed good,
>>> but there is so much conversion that for 14,000,000 reads I use > 20 Gigs
>>> of memory? Naively, that many reads should be about 1 Gig.
>>> 
>>> I am looking for low-level Biostrings functionality. something like
>>> applyString, or applyBase. I cannot fathom why this should take so long or
>>> consume so much memory.
>>> 
>>> thanks, jim
>>> 
>>> ## the read-set (note that I want to do this for 1e7 reads)
>>> BASES <- c("A", "C", "G", "T")
>>> reads <- DNAStringSet(sapply(1:2000, function(i) {
>>>  paste(sample(BASES, replace = T, size = 75),
>>>        collapse = "")
>>> }))
>>> 
>>> SubM <- structure(c(0.99, 0.0033, 0.0033, 0.0033, 0.0033, 0.99,
>>>                    0.0033,  0.0033, 0.0033, 0.0033, 0.99, 0.0033,
>>>                    0.0033, 0.0033, 0.0033,  0.99),
>>>                  .Dim = c(4L, 4L), .Dimnames = list(BASES, BASES))
>>> 
>>> f.naive <- function(ds) {
>>>  DNAStringSet(sapply(strsplit(as.character(ds), ""), function(read) {
>>>    paste(sapply(read, function(b) sample(BASES, prob = SubM[b,], size =
>>> 1)), collapse = "")
>>>  }))
>>> }
>>> 
>>> f.lessSo <- function(ds) {
>>>  cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>>>            'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>> 
>>>  V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>>>  V <- apply(do.call(cbind, lapply(1:ncol(V), function(i) {
>>>    sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
>>>  })), 1, paste, collapse = "")
>>> 
>>>  DNAStringSet(V)
>>> }
>>> 
>>> f.withViews <- function(ds) {
>>>  cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>>>            'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>> 
>>>  V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>>>  V <- DNAString(paste(as.character(t(do.call(cbind, lapply(1:ncol(V),
>>> function(i) {
>>>    sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
>>>  })))), collapse = ""))
>>>  DNAStringSet(Views(V, start = seq(1, length(V), by = 75),
>>>                     end = seq(75, length(V), by = 75)))
>>> }
>>> 
>>> 
>>> system.time(r <- f.naive(reads))
>>> ###    user  system elapsed
>>> ###   2.820   0.000   2.822
>>> 
>>> system.time(r <- f.lessSo(reads))
>>> ###    user  system elapsed
>>> ###   0.190   0.000   0.189
>>> 
>>> system.time(r <- f.withViews(reads))
>>> ###    user  system elapsed
>>> ###   0.090   0.010   0.097
>>> 
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319