[Bioc-sig-seq] a question about trimLRPatterns
Harris A. Jaffee
hj at jhu.edu
Wed Aug 31 18:50:16 CEST 2011
Sending back to the list, since others may be confused also.
On Aug 31, 2011, at 11:48 AM, wang peter wrote:
> DEAR HARRIS:
> I am shan, thank you very much for your kindly help.
> but i am still confused about the function of trimLRPatterns.
> like the example
> if i set
> > subject = "TTTACGT"
> > Lpattern = "TTTAACGT"
> the result is :
> > trimLRPatterns(Lpattern = Lpattern, subject = subject,
> max.Lmismatch=1,with.Lindels=TRUE)
> [1] ""
>
> but if i set
> > subject = "TTTACGT"
> > Lpattern = "AAATTTAACGT"
> the result is :
> > trimLRPatterns(Lpattern = Lpattern, subject = subject,
> max.Lmismatch=1,with.Lindels=TRUE)
> [1] "TTTACGT"
> how to explain it?
The problem is that max.Lmismatch is a vector that specifies one's
mismatch tolerances for the
successive match tests of the Lpattern suffixes, at the beginning of
the subject. The vector
is expected to be of length nchar(Lpattern), with the element
max.Lmismatch[i] controlling the
test for the suffix of length i. If a shorter vector is supplied, as
you did here (you give a
vector of length 1), the function expands that to a vector of length
nchar(Lpattern) by filling
with -1's at the *low end*. Your 1 becomes the last element of this
vector in both cases above.
This 1 is sufficient for "TTTAACGT" to match "TTTACGT" in the context
of with.Lindels=TRUE, but
it is not enough for "AAATTTAACGT" to match the same subject. You
would need 4 edits (deletions
of A) for that:
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
max.Lmismatch=3, with.Lindels=T)
[1] "TTTACGT"
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
max.Lmismatch=4, with.Lindels=T)
[1] ""
On the other hand, you can trim the entire subject a different way,
allowing for only 1 edit,
by employing the 4_th longest suffix of Lpattern, namely "TTTAACGT".
The commands below show
that 1 edit is not enough to trim the whole subject using the *3_rd
longest* Lpattern suffix,
namely "ATTTAACGT" (for which you would need 2 edits!):
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
max.Lmismatch=rep(1,3), with.Lindels=TRUE)
[1] "TTTACGT"
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
max.Lmismatch=rep(1,4), with.Lindels=TRUE)
[1] ""
# allows for 2 edits, for the 3 longest pattern suffixes:
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
max.Lmismatch=rep(2,3), with.Lindels=TRUE)
[1] ""
# shows exactly where the 2 is needed (for the 3_rd longest suffix):
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
max.Lmismatch=c(2,0,0), with.Lindels=TRUE)
[1] ""
To see the R code for trimLRPatterns, do
> showMethods("trimLRPatterns", includeDefs=TRUE)
and
> Biostrings:::.XStringSet.trimLRPatterns
and (for Lpattern)
> Biostrings:::.computeTrimStart
Also see ?which.isMatchingStartingAt
> and do you know how to read the c source code of trimLRPatterns
Start with the function XString_match_pattern_at() on
Biostrings/src/lowlevel_matching.c
This is called by .matchPatternAt() on R/lowlevel-matching.R.
> thank u very much
> shan gao
More information about the Bioc-sig-sequencing
mailing list