[Bioc-sig-seq] Fast inject into XStringSet objects

Hervé Pagès hpages at fhcrc.org
Tue May 19 20:17:16 CEST 2009


Hi Thomas,

Sorry for the slow response on this.

Here is a much faster solution to your problem. Depending on the
size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
faster than the sapply-based solution. Let's assume 'dset' is the
constant width DNAStringSet object containing your reads and 'myqual'
is the constant width PhredQuality object containing the qualities
of your reads.
The following solution takes advantage of the fact that 'dset' and
therefore 'myqual' are rectangular, hence 'myqual' can be turned into
a raw matrix pretty efficiently:

   myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
                        nrow=length(myqual), byrow=TRUE)

This raw matrix itself can be turned into a boolean matrix using some
cutoff value:

   at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))

Also I've just added a "replaceLetterAt" method for DNAStringSet objects
to Biostrings (in addition to the existing method for DNAString objects).
Currently, it has the limitation that 'x' and 'at' must be rectangular
i.e. 'x' must have a constant width and 'at' must be a logical matrix
with the same dimensions as 'x'. That's of course just what's needed to
handle your usecase:

   letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]), collapse=""))
   letter <- as(Views(letter_subject, start=1, end=rowSums(at)), "DNAStringSet")
   dset3 <- replaceLetterAt(dset, at, letter)

'dset3' is the same as the 'dset2' object you obtained with the sapply-based
solution.

This improved replaceLetterAt() function will be available in Biostrings
release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the public
repositories (in the next 24 hours).

Please let me know if you have any other concerns.

Cheers,
H.


Thomas Girke wrote:
> Dear Developers,
> 
> Is there an efficient method available for replacing base calls in 
> sequences by Ns (or any other character) at positions where the quality 
> score drops below some arbitrary threshold. So far we have been using 
> the following code for this purpose. In general this approach works fine. 
> However, the R loop for applying this function makes it slow for very large 
> data sets with millions of sequences. 
> 
> Is there a method available for achieving the same result more efficiently? 
> 
> Considering the available resources in Biostrings, I am wondering if it wouldn't 
> make a lot of sense of using Biostrings' sequence masking method along with the 
> injectHardMask function for this purpose. However, at this point I am only aware 
> of methods for applying masks efficiently to a single sequence in a XString object, 
> but not to many sequences stored in a XStringSet object. 
> 
> 
> #################
> ## Sample Code ##
> #################
> ## Create example data set with random data
> library(Biostrings)
> dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) 
> myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random Phred score list.
> myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) 
> myqual <- PhredQuality(myqual) 
> dsetq1 <- QualityScaledDNAStringSet(dset, myqual) 
> 
> ## Define in character inject function XStringSetInject     
> XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20, mychar="N") { 
>         mypos <- which(as.integer(myqual)<cutoff)
>         return(toString(replaceLetterAt(myseq, which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
> }
> ## Apply XStringSetInject function to all sequences.
> dvec <- sapply(1:length(dset), function(x) XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
> dset2 <- DNAStringSet(dvec)
> names(dset2) <- paste("d", 1:length(dvec), sep="")
> names(myqual) <- paste("d", 1:length(dvec), sep="")
> dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) 
> 
> ## Compare dsetq1 and dsetq2
> dsetq1[1:2] 
> dsetq2[1:2] 
> 
> 
>> sessionInfo()
> R version 2.9.0 (2009-04-17) 
> x86_64-unknown-linux-gnu 
> 
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] Biostrings_2.12.1 IRanges_1.2.0    
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.0
> 
> 
> Thanks in advance,
> 
> Thomas
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list