[Bioc-sig-seq] trimLRPatters: inconstitency re max.[LR]mismatch
Simon Anders
anders at embl.de
Wed Feb 17 14:53:45 CET 2010
Hi Patrick
I've just tried to trim adaptors of my Solexa reads and got a bit
puzzled about trimLRPatterns's max.Rmismatch argument.
Let's say I have two sequences as follows:
> library(ShortRead)
[...]
> subject <- DNAString( "CGCCGGCCGAAATTTAA" )
> pattern <- DNAString( "AAATTTAAATTTAAATTT" )
These have a clear overlapping alignment without mismatch:
CGCCGGCCGAAATTTAA <- subject
AAATTTAAATTTAAATTT <- (right) pattern
CGCCGGCCG <- trimmed subject
So, I would expect trimLRpattern to trim it to the given "trimmed
subject" sequence. However:
> trimLRPatterns( subject=subject, Rpattern=pattern )
17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA
As you can see, the subject untrimmed.
If I now specify allow for 10% mismatch, trimming works:
> trimLRPatterns( subject=subject, Rpattern=pattern, max.Rmismatch=.1 )
9-letter "DNAString" instance
seq: CGCCGGCCG
Why do I need to allow for mismatches?
This here is even a bit stranger: COmpare the result of allowing for
very small mismatches, say 10^-10, with exactly 0:
> trimLRPatterns( subject=subject, Rpattern=pattern,
+ max.Rmismatch=1e-10 )
9-letter "DNAString" instance
seq: CGCCGGCCG
> trimLRPatterns( subject=subject, Rpattern=pattern,
+ max.Rmismatch=0 )
17-letter "DNAString" instance
seq: CGCCGGCCGAAATTTAA
This is a bit dis-continuous, isn't it? ;-)
Cheers
Simon
> sessionInfo()
R version 2.11.0 Under development (unstable) (--)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.5.15 lattice_0.17-26 BSgenome_1.15.4
Biostrings_2.15.21
[5] IRanges_1.5.46
loaded via a namespace (and not attached):
[1] Biobase_2.7.2 grid_2.11.0 hwriter_1.1 tools_2.11.0
+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de
More information about the Bioc-sig-sequencing
mailing list