[Bioc-sig-seq] trimLRPatters: inconstitency re max.[LR]mismatch

Wed Feb 17 19:04:14 CET 2010

Simon,
Harris has been spending the most time thinking and working on this 
issue. I am including him in this e-mail.

Harris,
Do you think we should change the default behavior? I know you mentioned 
this is a previous e-mail and if users are not getting the results they 
are expecting, perhaps now is the time to bite the bullet (by breaking 
backwards compatibility) and change how 0 is interpreted by the 
max.[LR]mismatch parameter.

Patrick

On 2/17/10 5:53 AM, Simon Anders wrote:
> Hi Patrick
>
> I've just tried to trim adaptors of my Solexa reads and got a bit 
> puzzled about trimLRPatterns's max.Rmismatch argument.
>
> Let's say I have two sequences as follows:
>
> > library(ShortRead)
> [...]
> > subject <- DNAString( "CGCCGGCCGAAATTTAA" )
> > pattern <- DNAString( "AAATTTAAATTTAAATTT" )
>
> These have a clear overlapping alignment without mismatch:
>
>    CGCCGGCCGAAATTTAA <- subject
>             AAATTTAAATTTAAATTT <- (right) pattern
>
>    CGCCGGCCG <- trimmed subject
>
> So, I would expect trimLRpattern to trim it to the given "trimmed 
> subject" sequence. However:
>
> > trimLRPatterns( subject=subject, Rpattern=pattern )
>   17-letter "DNAString" instance
> seq: CGCCGGCCGAAATTTAA
>
> As you can see, the subject untrimmed.
>
> If I now specify allow for 10% mismatch, trimming works:
>
> > trimLRPatterns( subject=subject, Rpattern=pattern, max.Rmismatch=.1 )
>   9-letter "DNAString" instance
> seq: CGCCGGCCG
>
> Why do I need to allow for mismatches?
>
>
> This here is even a bit stranger: COmpare the result of allowing for 
> very small mismatches, say 10^-10, with exactly 0:
>
> > trimLRPatterns( subject=subject, Rpattern=pattern,
> +    max.Rmismatch=1e-10 )
>   9-letter "DNAString" instance
> seq: CGCCGGCCG
>
> > trimLRPatterns( subject=subject, Rpattern=pattern,
> +    max.Rmismatch=0 )
>   17-letter "DNAString" instance
> seq: CGCCGGCCGAAATTTAA
>
> This is a bit dis-continuous, isn't it? ;-)
>
>
> Cheers
>   Simon
>
> > sessionInfo()
> R version 2.11.0 Under development (unstable) (--)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.5.15   lattice_0.17-26    BSgenome_1.15.4 
> Biostrings_2.15.21
> [5] IRanges_1.5.46
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.2 grid_2.11.0   hwriter_1.1   tools_2.11.0
>
>
> +---
> | Dr. Simon Anders, Dipl.-Phys.
> | European Molecular Biology Laboratory (EMBL), Heidelberg
> | office phone +49-6221-387-8632
> | preferred (permanent) e-mail: sanders at fs.tum.de