[Bioc-sig-seq] trimLRPatterns question
Harris A. Jaffee
hj at jhu.edu
Sat Oct 30 04:14:35 CEST 2010
On Oct 29, 2010, at 6:17 PM, Kunbin Qu wrote:
> Is there a way to NOT trim a segment in the pattern from
> trimLRPatterns? For example:
>
> t1<-DNAString("AAAATGGTAC")
> t2<-DNAString("TGGTAC")
> t3<-trimLRPatterns(Rpattern=t2, subject=t1, max.Rmismatch = rep
> (0,4), ranges=T, with.Rindels=TRUE)
> t4<-DNAStringSet(t1, start=start(t3), end=end(t3))
>
>> t4
> A DNAStringSet instance of length 1
> width seq
> [1] 4 AAAA
>> t1
> 10-letter "DNAString" instance
> seq: AAAATGGTAC
>> t2
> 6-letter "DNAString" instance
> seq: TGGTAC
>>
>
> If I want to keep "TG" in t1, followed by the AAAA, which also
> appears in the beginning of the pattern t2, is there a way? The rep
> (0,4) does not help since it could find a perfect match at length 6
> (the first try when starting to trim). I am trying to use "TG" as
> my anchor points to avoid false over-trimming. Thanks.
If you mean to allow trimming a pattern like "TG.." in letters 7-10
or less of your subject, even
if TG occurs in letters 5-6, as in your t1, then you might want
Rpattern="TGGT", and max.Rmismatch
should have length 4 (or less, if you want to prevent trimming less
than 4 letters).
If I didn't get the point, a longer answer:
Usually, I over-trimmed when I set the mismatch limits too high
(especially with indels allowed),
probably when I used the "rate" feature for the mismatch parameter,
and that resulted in certain
mismatch elements being larger than I wanted. But you have zero
mismatches allowed, and your whole
Rpattern occurs exactly, as the final 6 letters of your subject. So
your example is a good example
of what the function does (except that you don't need to allow
indels, and you could get your t4 as
the value of trimLRPatterns by using ranges=F, the default). You can
put -1 in the mismatch vector,
for example
M = some integer
trimLRPatterns(Rpattern=t2, subject=t1, max.Rmismatch = c(rep(-1,5),
M))
This allows the 6-letter suffix of your subject to be trimmed if it
matches your Rpattern with M
or less errors, but it prevents trimming by anything shorter.
Actually, your rep(0,4) is the same
as c(rep(-1,2), rep(0,4)). That is, trimming by 6 or 5 letters is
allowed, but not by less. Thus,
> trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",
max.Rmismatch=0)
[1] "AAAA"
> trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",
max.Rmismatch=c(rep(-1,5), 0))
[1] "AAAA"
You can even put -1 as the 6th element of max.Rmismatch, but then you
might as well start with
Rpattern="TGGTA"; or -1 in positions 5 and 6, so, Rpattern="TGGT".
More information about the Bioc-sig-sequencing
mailing list