[Bioc-sig-seq] Fast inject into XStringSet objects
Hervé Pagès
hpages at fhcrc.org
Wed May 20 01:06:54 CEST 2009
Thomas,
Thomas Girke wrote:
> Dear Herve,
>
> Thanks a lot for your suggestions and efforts to optimize this step!
>
> We are often using sequences of variable length (e.g. after adaptor
> trimming), but this shouldn't be a problem because we can simply loop
> over every length interval (usually less than 100 iterations!).
Yes I could extend replaceLetterAt() so it supports DNAStringSet objects
of arbitrary shape. In that case, two formats for 'at' would be natural:
"list of logical vectors" or "list of integer vectors". Then you could do
something like this:
(1) Trim the adaptor:
trimming <- trimLRPatterns(..., dset, ranges=TRUE)
trimmed_dset <- narrow(dset, start=start(trimming), end=end(trimming))
trimmed_myqual <- narrow(myqual,
start=start(trimming),
end=end(trimming))
(2) Build the 'at' object to be used in replaceLetterAt():
flat_quals <- charToRaw(as.character(unlist(trimmed_myqual)))
flat_at <- flat_quals < charToRaw(as.character(PhredQuality(20L)))
at <- split(flat_at,
rep.int(seq_len(length(trimmed_myqual)),
width(trimmed_myqual)))
(3) Build the 'letter' object to be used in replaceLetterAt():
letter_subject <-
DNAString(paste(rep.int("N", max(width(trimmed_dset))), collapse=""))
letter <- as(Views(letter_subject, start=1, end=sapply(at, sum)),
"DNAStringSet")
(4) Call replaceLetterAt():
dset4 <- replaceLetterAt(trimmed_dset, at, letter)
Also some convenience could be added to Biostrings to support this use case
by wrapping some of the steps above into utility functions e.g.
o (1) could be done in a single call if we had a "trimLRPatterns"
method for QualityScaledDNAStringSet objects (which is easy
to add),
o (2) and (3) could be wrapped too but I don't have good names for
the wrappers and am not sure about their arguments either. I would
need to think a little bit about it.
What do you think? Would that work?
>
> One short question:
> Are there any plans to support masks in the future on XStringSet
> objects? If one could define and apply masks to sequences of variable
> length in XStringSet objects (without loops) then this would be be very
> useful for many applications.
No plans to support *soft* masking of XStringSet objects in the near
future because it's a too complicated thing to put in place, at least
for now. However *hard* masking (i.e. alteration of an XStringSet object
by injection of the "-" letter at some locations) is much easier to
support because 99% of the times you don't need to touch the implementation
of the functions that operate on XStringSet objects to make them work
on a hard-masked XStringSet object. They just keep working just fine.
The downside of hard masking is that every time the hard mask needs to be
modified, you need to make a new copy of the XStringSet object. But we
first need to see use cases where this becomes really an issue before
we start working on *soft* masking of XStringSet objects.
Cheers,
H.
>
> Many thanks again,
>
> Thomas
>
> On Tue, May 19, 2009 at 11:17:16AM -0700, Hervé Pagès wrote:
>> Hi Thomas,
>>
>> Sorry for the slow response on this.
>>
>> Here is a much faster solution to your problem. Depending on the
>> size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
>> faster than the sapply-based solution. Let's assume 'dset' is the
>> constant width DNAStringSet object containing your reads and 'myqual'
>> is the constant width PhredQuality object containing the qualities
>> of your reads.
>> The following solution takes advantage of the fact that 'dset' and
>> therefore 'myqual' are rectangular, hence 'myqual' can be turned into
>> a raw matrix pretty efficiently:
>>
>> myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
>> nrow=length(myqual), byrow=TRUE)
>>
>> This raw matrix itself can be turned into a boolean matrix using some
>> cutoff value:
>>
>> at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))
>>
>> Also I've just added a "replaceLetterAt" method for DNAStringSet objects
>> to Biostrings (in addition to the existing method for DNAString objects).
>> Currently, it has the limitation that 'x' and 'at' must be rectangular
>> i.e. 'x' must have a constant width and 'at' must be a logical matrix
>> with the same dimensions as 'x'. That's of course just what's needed to
>> handle your usecase:
>>
>> letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]),
>> collapse=""))
>> letter <- as(Views(letter_subject, start=1, end=rowSums(at)),
>> "DNAStringSet")
>> dset3 <- replaceLetterAt(dset, at, letter)
>>
>> 'dset3' is the same as the 'dset2' object you obtained with the sapply-based
>> solution.
>>
>> This improved replaceLetterAt() function will be available in Biostrings
>> release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the public
>> repositories (in the next 24 hours).
>>
>> Please let me know if you have any other concerns.
>>
>> Cheers,
>> H.
>>
>>
>> Thomas Girke wrote:
>>> Dear Developers,
>>>
>>> Is there an efficient method available for replacing base calls in
>>> sequences by Ns (or any other character) at positions where the quality
>>> score drops below some arbitrary threshold. So far we have been using
>>> the following code for this purpose. In general this approach works fine.
>>> However, the R loop for applying this function makes it slow for very
>>> large data sets with millions of sequences.
>>>
>>> Is there a method available for achieving the same result more
>>> efficiently?
>>> Considering the available resources in Biostrings, I am wondering if it
>>> wouldn't make a lot of sense of using Biostrings' sequence masking method
>>> along with the injectHardMask function for this purpose. However, at this
>>> point I am only aware of methods for applying masks efficiently to a
>>> single sequence in a XString object, but not to many sequences stored in a
>>> XStringSet object.
>>>
>>> #################
>>> ## Sample Code ##
>>> #################
>>> ## Create example data set with random data
>>> library(Biostrings)
>>> dset <- DNAStringSet(sapply(1:100, function(x)
>>> paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) myqlist <-
>>> lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random
>>> Phred score list.
>>> myqual <- sapply(myqlist, function(x) toString(PhredQuality(x)))
>>> myqual <- PhredQuality(myqual)
>>> dsetq1 <- QualityScaledDNAStringSet(dset, myqual)
>>>
>>> ## Define in character inject function XStringSetInject
>>> XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20,
>>> mychar="N") { mypos <- which(as.integer(myqual)<cutoff)
>>> return(toString(replaceLetterAt(myseq,
>>> which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
>>> }
>>> ## Apply XStringSetInject function to all sequences.
>>> dvec <- sapply(1:length(dset), function(x)
>>> XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
>>> dset2 <- DNAStringSet(dvec)
>>> names(dset2) <- paste("d", 1:length(dvec), sep="")
>>> names(myqual) <- paste("d", 1:length(dvec), sep="")
>>> dsetq2 <- QualityScaledDNAStringSet(dset2, myqual)
>>>
>>> ## Compare dsetq1 and dsetq2
>>> dsetq1[1:2]
>>> dsetq2[1:2]
>>>
>>>
>>>> sessionInfo()
>>> R version 2.9.0 (2009-04-17)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.12.1 IRanges_1.2.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.4.0
>>>
>>>
>>> Thanks in advance,
>>>
>>> Thomas
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list