[Bioc-sig-seq] Fast inject into XStringSet objects
Patrick Aboyoun
paboyoun at fhcrc.org
Wed May 20 03:24:54 CEST 2009
Herve,
The trimLRPatterns already can do (1) below so you can tick that off the
TODO list (This is because there is a narrow method for
QualityScaledXStringSet):
> suppressMessages(library(Biostrings))
> data(srPhiX174)
> x <- QualityScaledDNAStringSet(srPhiX174, SolexaQuality(quPhiX174))
> head(x)
A QualityScaledDNAStringSet instance containing:
A DNAStringSet instance of length 6
width seq
[1] 35 GTTATTATACCGTCAAGGACTGTGTGACTATTGAC
[2] 35 GGTGGTTATTATACCGTCAAGGACTGTGTGACTAT
[3] 35 TACCGTCAAGGACTGTGTGACTATTGACGTCCTTC
[4] 35 GTACGCCGGGCAATAATGTTTATGTTGGTTTCATG
[5] 35 GGTTTCATGGTTTGGTCTAACTTTACCGCTACTAA
[6] 35 GGGCAATAATGTTTATGTTGGTTTCATGGTTTGGT
A SolexaQuality instance of length 6
width seq
[1] 35 ZYZZZZZZZZZYYZZYYYYYYYYYYYYYYYYYQYY
[2] 35 ZZYZZYZZZZYYYYYYYYYYYYYYYYYYYVYYYTY
[3] 35 ZZZYZYYZYYZYYZYYYYYYYYYYYYYYVYYYYYY
[4] 35 ZZYZZZZZZZZZYZTYYYYYYYYYYYYYYYYYNYT
[5] 35 ZZZZZZYZYYZZZYYYYYYYYYYYYYYYYYSYYSY
[6] 35 ZZZZZYZYYYZYYZYYYYYYYYYYYSYQVYYYASY
> head(trimLRPatterns(Lpattern = "GT", Rpattern = "AT", subject = x))
A QualityScaledDNAStringSet instance containing:
A DNAStringSet instance of length 6
width seq
[1] 33 TATTATACCGTCAAGGACTGTGTGACTATTGAC
[2] 33 GGTGGTTATTATACCGTCAAGGACTGTGTGACT
[3] 35 TACCGTCAAGGACTGTGTGACTATTGACGTCCTTC
[4] 33 ACGCCGGGCAATAATGTTTATGTTGGTTTCATG
[5] 35 GGTTTCATGGTTTGGTCTAACTTTACCGCTACTAA
[6] 35 GGGCAATAATGTTTATGTTGGTTTCATGGTTTGGT
A SolexaQuality instance of length 6
width seq
[1] 33 ZZZZZZZZZYYZZYYYYYYYYYYYYYYYYYQYY
[2] 33 ZZYZZYZZZZYYYYYYYYYYYYYYYYYYYVYYY
[3] 35 ZZZYZYYZYYZYYZYYYYYYYYYYYYYYVYYYYYY
[4] 33 YZZZZZZZZZYZTYYYYYYYYYYYYYYYYYNYT
[5] 35 ZZZZZZYZYYZZZYYYYYYYYYYYYYYYYYSYYSY
[6] 35 ZZZZZYZYYYZYYZYYYYYYYYYYYSYQVYYYASY
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-05-08 r48504)
i386-apple-darwin9.6.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.13.8 IRanges_1.3.12
loaded via a namespace (and not attached):
[1] Biobase_2.5.2 tools_2.10.0
Hervé Pagès wrote:
> Thomas,
>
> Thomas Girke wrote:
>> Dear Herve,
>>
>> Thanks a lot for your suggestions and efforts to optimize this step!
>>
>> We are often using sequences of variable length (e.g. after adaptor
>> trimming), but this shouldn't be a problem because we can simply loop
>> over every length interval (usually less than 100 iterations!).
>
> Yes I could extend replaceLetterAt() so it supports DNAStringSet objects
> of arbitrary shape. In that case, two formats for 'at' would be natural:
> "list of logical vectors" or "list of integer vectors". Then you could do
> something like this:
>
> (1) Trim the adaptor:
>
> trimming <- trimLRPatterns(..., dset, ranges=TRUE)
> trimmed_dset <- narrow(dset, start=start(trimming),
> end=end(trimming))
> trimmed_myqual <- narrow(myqual,
> start=start(trimming),
> end=end(trimming))
>
> (2) Build the 'at' object to be used in replaceLetterAt():
>
> flat_quals <- charToRaw(as.character(unlist(trimmed_myqual)))
> flat_at <- flat_quals < charToRaw(as.character(PhredQuality(20L)))
> at <- split(flat_at,
> rep.int(seq_len(length(trimmed_myqual)),
> width(trimmed_myqual)))
>
> (3) Build the 'letter' object to be used in replaceLetterAt():
>
> letter_subject <-
> DNAString(paste(rep.int("N", max(width(trimmed_dset))),
> collapse=""))
> letter <- as(Views(letter_subject, start=1, end=sapply(at, sum)),
> "DNAStringSet")
>
> (4) Call replaceLetterAt():
>
> dset4 <- replaceLetterAt(trimmed_dset, at, letter)
>
> Also some convenience could be added to Biostrings to support this use
> case
> by wrapping some of the steps above into utility functions e.g.
> o (1) could be done in a single call if we had a "trimLRPatterns"
> method for QualityScaledDNAStringSet objects (which is easy
> to add),
> o (2) and (3) could be wrapped too but I don't have good names for
> the wrappers and am not sure about their arguments either. I would
> need to think a little bit about it.
>
> What do you think? Would that work?
>
>>
>> One short question:
>> Are there any plans to support masks in the future on XStringSet
>> objects? If one could define and apply masks to sequences of variable
>> length in XStringSet objects (without loops) then this would be be very
>> useful for many applications.
>
> No plans to support *soft* masking of XStringSet objects in the near
> future because it's a too complicated thing to put in place, at least
> for now. However *hard* masking (i.e. alteration of an XStringSet object
> by injection of the "-" letter at some locations) is much easier to
> support because 99% of the times you don't need to touch the
> implementation
> of the functions that operate on XStringSet objects to make them work
> on a hard-masked XStringSet object. They just keep working just fine.
> The downside of hard masking is that every time the hard mask needs to be
> modified, you need to make a new copy of the XStringSet object. But we
> first need to see use cases where this becomes really an issue before
> we start working on *soft* masking of XStringSet objects.
>
> Cheers,
> H.
>
>
>>
>> Many thanks again,
>> Thomas
>>
>> On Tue, May 19, 2009 at 11:17:16AM -0700, Hervé Pagès wrote:
>>> Hi Thomas,
>>>
>>> Sorry for the slow response on this.
>>>
>>> Here is a much faster solution to your problem. Depending on the
>>> size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
>>> faster than the sapply-based solution. Let's assume 'dset' is the
>>> constant width DNAStringSet object containing your reads and 'myqual'
>>> is the constant width PhredQuality object containing the qualities
>>> of your reads.
>>> The following solution takes advantage of the fact that 'dset' and
>>> therefore 'myqual' are rectangular, hence 'myqual' can be turned into
>>> a raw matrix pretty efficiently:
>>>
>>> myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
>>> nrow=length(myqual), byrow=TRUE)
>>>
>>> This raw matrix itself can be turned into a boolean matrix using some
>>> cutoff value:
>>>
>>> at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))
>>>
>>> Also I've just added a "replaceLetterAt" method for DNAStringSet
>>> objects
>>> to Biostrings (in addition to the existing method for DNAString
>>> objects).
>>> Currently, it has the limitation that 'x' and 'at' must be rectangular
>>> i.e. 'x' must have a constant width and 'at' must be a logical matrix
>>> with the same dimensions as 'x'. That's of course just what's needed to
>>> handle your usecase:
>>>
>>> letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]),
>>> collapse=""))
>>> letter <- as(Views(letter_subject, start=1, end=rowSums(at)),
>>> "DNAStringSet")
>>> dset3 <- replaceLetterAt(dset, at, letter)
>>>
>>> 'dset3' is the same as the 'dset2' object you obtained with the
>>> sapply-based
>>> solution.
>>>
>>> This improved replaceLetterAt() function will be available in
>>> Biostrings
>>> release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the
>>> public
>>> repositories (in the next 24 hours).
>>>
>>> Please let me know if you have any other concerns.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> Thomas Girke wrote:
>>>> Dear Developers,
>>>>
>>>> Is there an efficient method available for replacing base calls in
>>>> sequences by Ns (or any other character) at positions where the
>>>> quality score drops below some arbitrary threshold. So far we have
>>>> been using the following code for this purpose. In general this
>>>> approach works fine. However, the R loop for applying this function
>>>> makes it slow for very large data sets with millions of sequences.
>>>> Is there a method available for achieving the same result more
>>>> efficiently? Considering the available resources in Biostrings, I
>>>> am wondering if it wouldn't make a lot of sense of using
>>>> Biostrings' sequence masking method along with the injectHardMask
>>>> function for this purpose. However, at this point I am only aware
>>>> of methods for applying masks efficiently to a single sequence in a
>>>> XString object, but not to many sequences stored in a XStringSet
>>>> object.
>>>> #################
>>>> ## Sample Code ##
>>>> #################
>>>> ## Create example data set with random data
>>>> library(Biostrings)
>>>> dset <- DNAStringSet(sapply(1:100, function(x)
>>>> paste(sample(c("A","T","G","C"), 20, replace=T), collapse="")))
>>>> myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) #
>>>> Create random Phred score list.
>>>> myqual <- sapply(myqlist, function(x) toString(PhredQuality(x)))
>>>> myqual <- PhredQuality(myqual) dsetq1 <-
>>>> QualityScaledDNAStringSet(dset, myqual)
>>>> ## Define in character inject function XStringSetInject
>>>> XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1],
>>>> cutoff=20, mychar="N") { mypos <- which(as.integer(myqual)<cutoff)
>>>> return(toString(replaceLetterAt(myseq,
>>>> which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
>>>> }
>>>> ## Apply XStringSetInject function to all sequences.
>>>> dvec <- sapply(1:length(dset), function(x)
>>>> XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
>>>> dset2 <- DNAStringSet(dvec)
>>>> names(dset2) <- paste("d", 1:length(dvec), sep="")
>>>> names(myqual) <- paste("d", 1:length(dvec), sep="")
>>>> dsetq2 <- QualityScaledDNAStringSet(dset2, myqual)
>>>> ## Compare dsetq1 and dsetq2
>>>> dsetq1[1:2] dsetq2[1:2]
>>>>
>>>>> sessionInfo()
>>>> R version 2.9.0 (2009-04-17) x86_64-unknown-linux-gnu
>>>> locale:
>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>>>
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods
>>>> base
>>>> other attached packages:
>>>> [1] Biostrings_2.12.1 IRanges_1.2.0
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.4.0
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Thomas
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M2-B876
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>
More information about the Bioc-sig-sequencing
mailing list