[Bioc-sig-seq] Optimizing the generation of reads w/Biostrings.
James Bullard
bullard at stat.berkeley.edu
Mon Mar 22 02:39:42 CET 2010
Hi Herve, this is great. I am going to take your code and benchmark on my system. In the meantime, I wrote a python version, which did outperform my R versions, but now we might much more competitive.
thanks again, jim
On Mar 20, 2010, at 4:14 PM, Hervé Pagès wrote:
> I need to correct this a little bit. Yes sample() works on DNAString
> and DNAStringSet objects (and on any object with a "length" and a "["
> method). So randomReads() is better implemented like this:
>
> randomReads <- function(nread, readlength)
> {
> totalsize <- nread * readlength
> bsize <- 500000L
> blocks <- breakInBlocks(totalsize, bsize)
> tmp <- lapply(width(blocks),
> function(bs)
> sample(DNAString(paste(DNA_BASES, collapse="")),
> bs, replace=TRUE))
> ans_subject <- do.call(c, tmp)
> successiveViews(ans_subject, rep.int(readlength, nread))
> }
>
> It's about 3x faster than my previous version.
>
> Cheers,
> H.
>
>
> Hervé Pagès wrote:
>> Hi Jim,
>> This is an interesting use case and it reveals some of the limitations
>> of the current Biostrings infrastructure. For example some operations
>> like sample() don't work natively on DNAString or DNAStringSet objects.
>> Having this capability would certainly help here but there are other
>> bits that would need to be added to make things even easier, faster,
>> and more memory efficient than the workaround I propose below.
>> Writing the code below (and get it right, I think) took me a fair
>> amount of time but was a really helpful exercise because it helped
>> me identify/understand the bottlenecks. It still does too many
>> coercions between the XString, character, and raw representations,
>> and some of them could (and should) definitely be avoided if we had
>> those missing bits in Biostrings.
>> So for now one option is to do the sampling in the character space
>> which requires that you start converting your big DNAStringSet object
>> back to a standard character vector. Then I see that you use
>> strsplit( , NULL) on this character vector which generates a huge
>> object (8G for your 14 millions 75-mers on a 64 bit machine, because
>> each 1-letter string here will take 8 bytes in memory).
>> Using a "break-process-recombine" approach i.e. breaking the big
>> object into smaller pieces, processing them separately, and then
>> recombining the results can help reduce the memory footprint
>> dramatically without impacting too much the speed. In the end, I'm
>> able to mutate your 14 million 75-mers with a substitution matrix
>> that is cycle dependent, in about 3 minutes and with about 3.8Gb
>> of RAM (I did this on my laptop).
>> But first I'll start with an illustration of the
>> "break-process-recombine" approach for the reads generation.
>> library(Biostrings)
>> breakInBlocks <- function(totalsize, bsize)
>> {
>> quot <- totalsize %/% bsize
>> ans_width <- rep.int(bsize, quot)
>> rem <- totalsize %% bsize
>> if (rem > 0L)
>> ans_width <- c(ans_width, rem)
>> IRanges(end=cumsum(ans_width), width=ans_width)
>> }
>> randomReads <- function(nread, readlength)
>> {
>> totalsize <- nread * readlength
>> bsize <- 500000L
>> blocks <- breakInBlocks(totalsize, bsize)
>> tmp <- lapply(width(blocks),
>> function(bs)
>> DNAString(paste(sample(DNA_BASES, replace=TRUE, size=bs),
>> collapse="")))
>> ans_subject <- do.call(c, tmp)
>> successiveViews(ans_subject, rep.int(readlength, nread))
>> }
>> Then generating 2 million 75-mers with 'randomReads(2000000L, 75L)'
>> takes 37 sec. and requires 343Mb of memory on my system whereas
>> without the "break-process-recombine" approach it requires 10x
>> more memory!
>> Generate the 14 million 75-mers:
>> > system.time(reads <- randomReads(14000000L, 75L))
>> user system elapsed
>> 208.817 2.356 211.534
>> > gc()
>> used (Mb) gc trigger (Mb) max used (Mb)
>> Ncells 924564 49.4 1590760 85.0 1590760 85.0
>> Vcells 145696189 1111.6 353180113 2694.6 332717020 2538.5
>> The same approach can be used to mutate the reads. In a first pass
>> I address the simpler case where the substitution matrix is cycle
>> independent, which allows for even more optimization. One additional
>> optimization here is to use a raw vector instead of a vector of 1-letter
>> character strings as the input and output of sample():
>> ## Here 's' must be a single character string.
>> mutateString <- function(s, SubM)
>> {
>> bytes <- charToRaw(s)
>> colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
>> for (letter in rownames(SubM)) {
>> b <- charToRaw(letter)
>> idx <- bytes == b
>> bytes[bytes == b] <- sample(colbytes, sum(idx),
>> replace=TRUE, prob=SubM[letter, ])
>> }
>> rawToChar(bytes)
>> }
>> ## Here 'v' must be a Views object with a DNAString subject.
>> mutateReads <- function(v, SubM)
>> {
>> totalsize <- length(subject(v))
>> bsize <- 1000000L
>> blocks <- breakInBlocks(totalsize, bsize)
>> tmp <- lapply(seq_len(length(blocks)),
>> function(i) {
>> s <- subseq(subject(v),
>> start=start(blocks)[i], width=width(blocks)[i])
>> DNAString(mutateString(as.character(s), SubM))
>> })
>> ans_subject <- do.call(c, tmp)
>> successiveViews(ans_subject, width(v))
>> }
>> SubM <- matrix(0.0033, nrow=4, ncol=4)
>> diag(SubM) <- 1 - sum(SubM[1, 2:4])
>> rownames(SubM) <- colnames(SubM) <- DNA_BASES
>> > system.time(mreads <- mutateReads(reads, SubM))
>> user system elapsed
>> 127.828 1.584 129.674
>> > gc()
>> used (Mb) gc trigger (Mb) max used (Mb)
>> Ncells 925420 49.5 1590760 85.0 1590760 85.0
>> Vcells 290946618 2219.8 450911592 3440.2 436207956 3328.1
>> Finally the more general case where the substitution matrix is cycle
>> dependent:
>> ## Here 'm' must be a raw matrix object with 1 read per row
>> ## and 'SubMs' a tridimensional numeric array where 'SubMs[ , , i]'
>> ## is the substitution matrix for cycle 'i'.
>> mutateRawMatrix <- function(m, SubMs)
>> {
>> for (i in seq_len(ncol(m))) {
>> bytes <- m[ , i]
>> SubM <- SubMs[ , , i]
>> colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
>> for (letter in rownames(SubM)) {
>> b <- charToRaw(letter)
>> idx <- bytes == b
>> bytes[bytes == b] <- sample(colbytes, sum(idx),
>> replace=TRUE, prob=SubM[letter, ])
>> }
>> m[ , i] <- bytes
>> }
>> m
>> }
>> mkSubM <- function(mutprob=0)
>> {
>> SubM <- matrix(mutprob, nrow=4, ncol=4)
>> diag(SubM) <- 1 - sum(SubM[1, 2:4])
>> rownames(SubM) <- colnames(SubM) <- DNA_BASES
>> SubM
>> }
>> SubMs <- array(
>> do.call(c, lapply(1:75, function(i) mkSubM(0.0033+(i-1)*0.002))),
>> dim=c(4,4,75), dimnames=list(DNA_BASES, DNA_BASES, NULL))
>> mutateReads2 <- function(v, SubMs)
>> {
>> if (!isConstant(width(v)))
>> stop("'v' must be rectangular")
>> vwidth <- width(v)[1L]
>> if (length(dim(SubMs)) != 3L || dim(SubMs)[3L] < vwidth)
>> stop("invalid 'SubMs' (wrong dimensions)")
>> totalsize <- length(v)
>> bsize <- 250000L
>> blocks <- breakInBlocks(totalsize, bsize)
>> tmp <- lapply(seq_len(length(blocks)),
>> function(i) {
>> vv <- v[as.integer(blocks[i])]
>> s <- subseq(subject(v),
>> start=start(vv)[1L], end=end(vv)[length(vv)])
>> m <- matrix(charToRaw(as.character(s)),
>> ncol=vwidth, byrow=TRUE)
>> DNAString(rawToChar(t(mutateRawMatrix(m, SubMs))))
>> })
>> ans_subject <- do.call(c, tmp)
>> successiveViews(ans_subject, width(v))
>> }
>> > system.time(mreads2 <- mutateReads2(reads, SubMs))
>> user system elapsed
>> 178.043 1.812 180.331
>> > gc()
>> used (Mb) gc trigger (Mb) max used (Mb)
>> Ncells 955107 51.1 1590760 85.0 1590760 85.0
>> Vcells 293551309 2239.7 497294029 3794.1 491180530 3747.5
>> Cheers,
>> H.
>> bullard at stat.berkeley.edu wrote:
>>> Hi All, I have been trying (rather futilely) to optimize the performance
>>> of some simulations which I am trying to do. Briefly, I have a set of
>>> reads which are sampled from a genome. These reads are stored in a
>>> DNAStringSet. The next thing I want to do is apply an error model to
>>> 'mutate' the reads.
>>>
>>> Below, are three implementations for the same procedure. The first is very
>>> slow and doesn't leverage vectorization. The second is faster, leverages
>>> vectorization, but is done in a way that is counter-intuitive (one thing
>>> to note is that in practice the substitution matrices might be cycle
>>> dependent, hence why I walk over the columns). The third way seemed good,
>>> but there is so much conversion that for 14,000,000 reads I use > 20 Gigs
>>> of memory? Naively, that many reads should be about 1 Gig.
>>>
>>> I am looking for low-level Biostrings functionality. something like
>>> applyString, or applyBase. I cannot fathom why this should take so long or
>>> consume so much memory.
>>>
>>> thanks, jim
>>>
>>> ## the read-set (note that I want to do this for 1e7 reads)
>>> BASES <- c("A", "C", "G", "T")
>>> reads <- DNAStringSet(sapply(1:2000, function(i) {
>>> paste(sample(BASES, replace = T, size = 75),
>>> collapse = "")
>>> }))
>>>
>>> SubM <- structure(c(0.99, 0.0033, 0.0033, 0.0033, 0.0033, 0.99,
>>> 0.0033, 0.0033, 0.0033, 0.0033, 0.99, 0.0033,
>>> 0.0033, 0.0033, 0.0033, 0.99),
>>> .Dim = c(4L, 4L), .Dimnames = list(BASES, BASES))
>>>
>>> f.naive <- function(ds) {
>>> DNAStringSet(sapply(strsplit(as.character(ds), ""), function(read) {
>>> paste(sapply(read, function(b) sample(BASES, prob = SubM[b,], size =
>>> 1)), collapse = "")
>>> }))
>>> }
>>>
>>> f.lessSo <- function(ds) {
>>> cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>>> 'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>>
>>> V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>>> V <- apply(do.call(cbind, lapply(1:ncol(V), function(i) {
>>> sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
>>> })), 1, paste, collapse = "")
>>>
>>> DNAStringSet(V)
>>> }
>>>
>>> f.withViews <- function(ds) {
>>> cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>>> 'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>>
>>> V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>>> V <- DNAString(paste(as.character(t(do.call(cbind, lapply(1:ncol(V),
>>> function(i) {
>>> sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
>>> })))), collapse = ""))
>>> DNAStringSet(Views(V, start = seq(1, length(V), by = 75),
>>> end = seq(75, length(V), by = 75)))
>>> }
>>>
>>>
>>> system.time(r <- f.naive(reads))
>>> ### user system elapsed
>>> ### 2.820 0.000 2.822
>>>
>>> system.time(r <- f.lessSo(reads))
>>> ### user system elapsed
>>> ### 0.190 0.000 0.189
>>>
>>> system.time(r <- f.withViews(reads))
>>> ### user system elapsed
>>> ### 0.090 0.010 0.097
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list