[Bioc-sig-seq] Fast inject into XStringSet objects
Thomas Girke
thomas.girke at ucr.edu
Tue May 5 01:48:28 CEST 2009
Dear Developers,
Is there an efficient method available for replacing base calls in
sequences by Ns (or any other character) at positions where the quality
score drops below some arbitrary threshold. So far we have been using
the following code for this purpose. In general this approach works fine.
However, the R loop for applying this function makes it slow for very large
data sets with millions of sequences.
Is there a method available for achieving the same result more efficiently?
Considering the available resources in Biostrings, I am wondering if it wouldn't
make a lot of sense of using Biostrings' sequence masking method along with the
injectHardMask function for this purpose. However, at this point I am only aware
of methods for applying masks efficiently to a single sequence in a XString object,
but not to many sequences stored in a XStringSet object.
#################
## Sample Code ##
#################
## Create example data set with random data
library(Biostrings)
dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse="")))
myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random Phred score list.
myqual <- sapply(myqlist, function(x) toString(PhredQuality(x)))
myqual <- PhredQuality(myqual)
dsetq1 <- QualityScaledDNAStringSet(dset, myqual)
## Define in character inject function XStringSetInject
XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20, mychar="N") {
mypos <- which(as.integer(myqual)<cutoff)
return(toString(replaceLetterAt(myseq, which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
}
## Apply XStringSetInject function to all sequences.
dvec <- sapply(1:length(dset), function(x) XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
dset2 <- DNAStringSet(dvec)
names(dset2) <- paste("d", 1:length(dvec), sep="")
names(myqual) <- paste("d", 1:length(dvec), sep="")
dsetq2 <- QualityScaledDNAStringSet(dset2, myqual)
## Compare dsetq1 and dsetq2
dsetq1[1:2]
dsetq2[1:2]
> sessionInfo()
R version 2.9.0 (2009-04-17)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.12.1 IRanges_1.2.0
loaded via a namespace (and not attached):
[1] Biobase_2.4.0
Thanks in advance,
Thomas
More information about the Bioc-sig-sequencing
mailing list