[BioC] A problem about trimLRPatterns
Harris A. Jaffee
hj at jhu.edu
Fri Mar 2 01:00:27 CET 2012
I'm forwarding back to the list so that others can check my answer.
What is going on is something to this effect.
subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA"
pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG"
sapply(15:nchar(subject), function(j) {
s = substr(subject, j, nchar(subject))
p = substr(pattern, 1, nchar(subject)-j+1)
neditEndingAt(ending.at=nchar(s), pattern = p, subject = s, with.indels=TRUE)
})
[1] 8 7 6 5 4 3 2 1 0 2 4 6 8 10 11 11 10 9 8 8 9 8 7 7 6
[26] 5 5 4 3 2 4 3 2 1
(I started my sapply at j=15 to reduce the noise for j less than 15. You
can start at j=1 if you like. Those edit values will be larger than 8 and
thus too large given your limits.)
When you say "should be", you are focusing on the smallest number (0) in this
vector, which I admit reflects the smallest edit distance between the strings
s and p, as s and p vary with j, with j going from 15 to nchar(subject).
But trimLRPatterns is designed to perform the longest trim possible given your
mismatch limits, even if that is not the one with minimal edit distance.
I think you will find that edit distance 8 (the first element in the vector,
corresponding to j=15) is within your mismatch limits. This is why the function
trims from position 15 on, leaving the 14-character result, "TATAGTAGATATTG".
Does that explain it?
On Mar 1, 2012, at 5:48 PM, wang peter wrote:
> can you try this sample
> the adapter is PCR2rc CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG
>
> the function is
> max.mismatchs <- 0.25*1:nchar(DNAString(PCR2rc))
> trimmedCoords <- trimLRPatterns(Rpattern = PCR2rc, subject =
> sread(highQuaReads), max.Rmismatch= max.mismatchs,
> with.Rindels=T,ranges=T)
>
> highQuaReads is one read:
>
> @HWI-ST132:506:D0CNUABXX:3:1101:1576:1834 1:N:0:TTCACA
> TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA
> +
> :BDDFDHHGHHJJJIIJIIIIEIGIJJJBGHIFEEH9EGIBHEHG6:8
>
>
> and the results is
>
> @HWI-ST132:506:D0CNUABXX:3:1101:1576:1834 1:N:0:TTCACA
> TATAGTAGATATTG
> +
> :BDDFDHHGHHJJJ
>
>
> u see
> it should be
> TATAGTAGATATTGGAATAGTA
>
> i have many samples like this
>
> --
> shan gao
> Room 231(Dr.Fei lab)
> Boyce Thompson Institute
> Cornell University
> Tower Road, Ithaca, NY 14853-1801
> Office phone: 1-607-254-1267(day)
> Official email:sg839 at cornell.edu
> Facebook:http://www.facebook.com/profile.php?id=100001986532253
More information about the Bioconductor
mailing list