[Bioc-sig-seq] Optimizing the generation of reads w/Biostrings.
Hervé Pagès
hpages at fhcrc.org
Sun Mar 21 00:14:47 CET 2010
I need to correct this a little bit. Yes sample() works on DNAString
and DNAStringSet objects (and on any object with a "length" and a "["
method). So randomReads() is better implemented like this:
randomReads <- function(nread, readlength)
{
totalsize <- nread * readlength
bsize <- 500000L
blocks <- breakInBlocks(totalsize, bsize)
tmp <- lapply(width(blocks),
function(bs)
sample(DNAString(paste(DNA_BASES, collapse="")),
bs, replace=TRUE))
ans_subject <- do.call(c, tmp)
successiveViews(ans_subject, rep.int(readlength, nread))
}
It's about 3x faster than my previous version.
Cheers,
H.
Hervé Pagès wrote:
> Hi Jim,
>
> This is an interesting use case and it reveals some of the limitations
> of the current Biostrings infrastructure. For example some operations
> like sample() don't work natively on DNAString or DNAStringSet objects.
> Having this capability would certainly help here but there are other
> bits that would need to be added to make things even easier, faster,
> and more memory efficient than the workaround I propose below.
>
> Writing the code below (and get it right, I think) took me a fair
> amount of time but was a really helpful exercise because it helped
> me identify/understand the bottlenecks. It still does too many
> coercions between the XString, character, and raw representations,
> and some of them could (and should) definitely be avoided if we had
> those missing bits in Biostrings.
>
> So for now one option is to do the sampling in the character space
> which requires that you start converting your big DNAStringSet object
> back to a standard character vector. Then I see that you use
> strsplit( , NULL) on this character vector which generates a huge
> object (8G for your 14 millions 75-mers on a 64 bit machine, because
> each 1-letter string here will take 8 bytes in memory).
>
> Using a "break-process-recombine" approach i.e. breaking the big
> object into smaller pieces, processing them separately, and then
> recombining the results can help reduce the memory footprint
> dramatically without impacting too much the speed. In the end, I'm
> able to mutate your 14 million 75-mers with a substitution matrix
> that is cycle dependent, in about 3 minutes and with about 3.8Gb
> of RAM (I did this on my laptop).
>
> But first I'll start with an illustration of the
> "break-process-recombine" approach for the reads generation.
>
> library(Biostrings)
>
> breakInBlocks <- function(totalsize, bsize)
> {
> quot <- totalsize %/% bsize
> ans_width <- rep.int(bsize, quot)
> rem <- totalsize %% bsize
> if (rem > 0L)
> ans_width <- c(ans_width, rem)
> IRanges(end=cumsum(ans_width), width=ans_width)
> }
>
> randomReads <- function(nread, readlength)
> {
> totalsize <- nread * readlength
> bsize <- 500000L
> blocks <- breakInBlocks(totalsize, bsize)
> tmp <- lapply(width(blocks),
> function(bs)
> DNAString(paste(sample(DNA_BASES, replace=TRUE, size=bs),
> collapse="")))
> ans_subject <- do.call(c, tmp)
> successiveViews(ans_subject, rep.int(readlength, nread))
> }
>
> Then generating 2 million 75-mers with 'randomReads(2000000L, 75L)'
> takes 37 sec. and requires 343Mb of memory on my system whereas
> without the "break-process-recombine" approach it requires 10x
> more memory!
>
> Generate the 14 million 75-mers:
>
> > system.time(reads <- randomReads(14000000L, 75L))
> user system elapsed
> 208.817 2.356 211.534
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 924564 49.4 1590760 85.0 1590760 85.0
> Vcells 145696189 1111.6 353180113 2694.6 332717020 2538.5
>
> The same approach can be used to mutate the reads. In a first pass
> I address the simpler case where the substitution matrix is cycle
> independent, which allows for even more optimization. One additional
> optimization here is to use a raw vector instead of a vector of 1-letter
> character strings as the input and output of sample():
>
> ## Here 's' must be a single character string.
> mutateString <- function(s, SubM)
> {
> bytes <- charToRaw(s)
> colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
> for (letter in rownames(SubM)) {
> b <- charToRaw(letter)
> idx <- bytes == b
> bytes[bytes == b] <- sample(colbytes, sum(idx),
> replace=TRUE, prob=SubM[letter, ])
> }
> rawToChar(bytes)
> }
>
> ## Here 'v' must be a Views object with a DNAString subject.
> mutateReads <- function(v, SubM)
> {
> totalsize <- length(subject(v))
> bsize <- 1000000L
> blocks <- breakInBlocks(totalsize, bsize)
> tmp <- lapply(seq_len(length(blocks)),
> function(i) {
> s <- subseq(subject(v),
> start=start(blocks)[i], width=width(blocks)[i])
> DNAString(mutateString(as.character(s), SubM))
> })
> ans_subject <- do.call(c, tmp)
> successiveViews(ans_subject, width(v))
> }
>
> SubM <- matrix(0.0033, nrow=4, ncol=4)
> diag(SubM) <- 1 - sum(SubM[1, 2:4])
> rownames(SubM) <- colnames(SubM) <- DNA_BASES
>
> > system.time(mreads <- mutateReads(reads, SubM))
> user system elapsed
> 127.828 1.584 129.674
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 925420 49.5 1590760 85.0 1590760 85.0
> Vcells 290946618 2219.8 450911592 3440.2 436207956 3328.1
>
>
> Finally the more general case where the substitution matrix is cycle
> dependent:
>
> ## Here 'm' must be a raw matrix object with 1 read per row
> ## and 'SubMs' a tridimensional numeric array where 'SubMs[ , , i]'
> ## is the substitution matrix for cycle 'i'.
> mutateRawMatrix <- function(m, SubMs)
> {
> for (i in seq_len(ncol(m))) {
> bytes <- m[ , i]
> SubM <- SubMs[ , , i]
> colbytes <- charToRaw(paste(colnames(SubM), collapse=""))
> for (letter in rownames(SubM)) {
> b <- charToRaw(letter)
> idx <- bytes == b
> bytes[bytes == b] <- sample(colbytes, sum(idx),
> replace=TRUE, prob=SubM[letter, ])
> }
> m[ , i] <- bytes
> }
> m
> }
>
> mkSubM <- function(mutprob=0)
> {
> SubM <- matrix(mutprob, nrow=4, ncol=4)
> diag(SubM) <- 1 - sum(SubM[1, 2:4])
> rownames(SubM) <- colnames(SubM) <- DNA_BASES
> SubM
> }
>
> SubMs <- array(
> do.call(c, lapply(1:75, function(i) mkSubM(0.0033+(i-1)*0.002))),
> dim=c(4,4,75), dimnames=list(DNA_BASES, DNA_BASES, NULL))
>
> mutateReads2 <- function(v, SubMs)
> {
> if (!isConstant(width(v)))
> stop("'v' must be rectangular")
> vwidth <- width(v)[1L]
> if (length(dim(SubMs)) != 3L || dim(SubMs)[3L] < vwidth)
> stop("invalid 'SubMs' (wrong dimensions)")
> totalsize <- length(v)
> bsize <- 250000L
> blocks <- breakInBlocks(totalsize, bsize)
> tmp <- lapply(seq_len(length(blocks)),
> function(i) {
> vv <- v[as.integer(blocks[i])]
> s <- subseq(subject(v),
> start=start(vv)[1L], end=end(vv)[length(vv)])
> m <- matrix(charToRaw(as.character(s)),
> ncol=vwidth, byrow=TRUE)
> DNAString(rawToChar(t(mutateRawMatrix(m, SubMs))))
> })
> ans_subject <- do.call(c, tmp)
> successiveViews(ans_subject, width(v))
> }
>
> > system.time(mreads2 <- mutateReads2(reads, SubMs))
> user system elapsed
> 178.043 1.812 180.331
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 955107 51.1 1590760 85.0 1590760 85.0
> Vcells 293551309 2239.7 497294029 3794.1 491180530 3747.5
>
> Cheers,
> H.
>
>
> bullard at stat.berkeley.edu wrote:
>> Hi All, I have been trying (rather futilely) to optimize the performance
>> of some simulations which I am trying to do. Briefly, I have a set of
>> reads which are sampled from a genome. These reads are stored in a
>> DNAStringSet. The next thing I want to do is apply an error model to
>> 'mutate' the reads.
>>
>> Below, are three implementations for the same procedure. The first is
>> very
>> slow and doesn't leverage vectorization. The second is faster, leverages
>> vectorization, but is done in a way that is counter-intuitive (one thing
>> to note is that in practice the substitution matrices might be cycle
>> dependent, hence why I walk over the columns). The third way seemed good,
>> but there is so much conversion that for 14,000,000 reads I use > 20 Gigs
>> of memory? Naively, that many reads should be about 1 Gig.
>>
>> I am looking for low-level Biostrings functionality. something like
>> applyString, or applyBase. I cannot fathom why this should take so
>> long or
>> consume so much memory.
>>
>> thanks, jim
>>
>> ## the read-set (note that I want to do this for 1e7 reads)
>> BASES <- c("A", "C", "G", "T")
>> reads <- DNAStringSet(sapply(1:2000, function(i) {
>> paste(sample(BASES, replace = T, size = 75),
>> collapse = "")
>> }))
>>
>> SubM <- structure(c(0.99, 0.0033, 0.0033, 0.0033, 0.0033, 0.99,
>> 0.0033, 0.0033, 0.0033, 0.0033, 0.99, 0.0033,
>> 0.0033, 0.0033, 0.0033, 0.99),
>> .Dim = c(4L, 4L), .Dimnames = list(BASES, BASES))
>>
>> f.naive <- function(ds) {
>> DNAStringSet(sapply(strsplit(as.character(ds), ""), function(read) {
>> paste(sapply(read, function(b) sample(BASES, prob = SubM[b,], size =
>> 1)), collapse = "")
>> }))
>> }
>>
>> f.lessSo <- function(ds) {
>> cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>> 'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>
>> V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>> V <- apply(do.call(cbind, lapply(1:ncol(V), function(i) {
>> sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace =
>> TRUE)
>> })), 1, paste, collapse = "")
>>
>> DNAStringSet(V)
>> }
>>
>> f.withViews <- function(ds) {
>> cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
>> 'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')
>>
>> V <- do.call(rbind, strsplit(as.character(ds), split = ""))
>> V <- DNAString(paste(as.character(t(do.call(cbind, lapply(1:ncol(V),
>> function(i) {
>> sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace =
>> TRUE)
>> })))), collapse = ""))
>> DNAStringSet(Views(V, start = seq(1, length(V), by = 75),
>> end = seq(75, length(V), by = 75)))
>> }
>>
>>
>> system.time(r <- f.naive(reads))
>> ### user system elapsed
>> ### 2.820 0.000 2.822
>>
>> system.time(r <- f.lessSo(reads))
>> ### user system elapsed
>> ### 0.190 0.000 0.189
>>
>> system.time(r <- f.withViews(reads))
>> ### user system elapsed
>> ### 0.090 0.010 0.097
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list