[Bioc-sig-seq] trimLRPatters: inconstitency re max.[LR]mismatch
Harris A. Jaffee
hj at jhu.edu
Wed Feb 17 19:56:28 CET 2010
Absolutely!
On Feb 3, 2010, at 2:57 PM, Harris A. Jaffee wrote:
> <patch_for_trimLRPatterns>
> This is my solution. It vastly simplifies things, supports indels
> on both sides,
> and corrects the behavior of trimming based on non-flanking
> matches. All matching
> is now tied to the edge of the subject (i.e., starting.at=1 and
> going forward, or
> ending.at = "the end" and extending back). Trimming is now by the
> length of the
> longest acceptably matching pattern rather than based on any match,
> of which there
> can of course be many within the edit tolerance. Setting the
> tolerances with indels
> allowed is now clear enough from the man page that even I
> understand it, whereas
> before I had no chance! ...
> I did NOT change the default R/L-mismatch defaults of 0, but I
> wanted to. I
> believe 0 should come under the realm of an "error rate" (currently
> required to be
> >0 and <1). So, I suggest that 0 becomes an all zero vector of the
> length of the
> R/L-pattern, rather than a bunch of -1's plus one 0 at the end.
So, I would change the man page (and its supporting code) to
something like
A single numeric value in '[0, 1)' is taken as a "rate" ...
This would break backward compatibility, but I view it more as fixing
a buggy
spec/implementation.
The indels story is a little different. I view that as a faulty
implementation,
which I believe I fixed, at least partially, of a good spec (trimming
*flanking*
patterns). "partially" gets complicated, so I won't explain here...
-Harris
On Feb 17, 2010, at 1:04 PM, Patrick Aboyoun wrote:
> Simon,
> Harris has been spending the most time thinking and working on this
> issue. I am including him in this e-mail.
>
> Harris,
> Do you think we should change the default behavior? I know you
> mentioned this is a previous e-mail and if users are not getting
> the results they are expecting, perhaps now is the time to bite the
> bullet (by breaking backwards compatibility) and change how 0 is
> interpreted by the max.[LR]mismatch parameter.
>
>
> Patrick
>
>
> On 2/17/10 5:53 AM, Simon Anders wrote:
>> Hi Patrick
>>
>> I've just tried to trim adaptors of my Solexa reads and got a bit
>> puzzled about trimLRPatterns's max.Rmismatch argument.
>>
>> Let's say I have two sequences as follows:
>>
>> > library(ShortRead)
>> [...]
>> > subject <- DNAString( "CGCCGGCCGAAATTTAA" )
>> > pattern <- DNAString( "AAATTTAAATTTAAATTT" )
>>
>> These have a clear overlapping alignment without mismatch:
>>
>> CGCCGGCCGAAATTTAA <- subject
>> AAATTTAAATTTAAATTT <- (right) pattern
>>
>> CGCCGGCCG <- trimmed subject
>>
>> So, I would expect trimLRpattern to trim it to the given "trimmed
>> subject" sequence. However:
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern )
>> 17-letter "DNAString" instance
>> seq: CGCCGGCCGAAATTTAA
>>
>> As you can see, the subject untrimmed.
>>
>> If I now specify allow for 10% mismatch, trimming works:
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern,
>> max.Rmismatch=.1 )
>> 9-letter "DNAString" instance
>> seq: CGCCGGCCG
>>
>> Why do I need to allow for mismatches?
>>
>>
>> This here is even a bit stranger: COmpare the result of allowing
>> for very small mismatches, say 10^-10, with exactly 0:
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern,
>> + max.Rmismatch=1e-10 )
>> 9-letter "DNAString" instance
>> seq: CGCCGGCCG
>>
>> > trimLRPatterns( subject=subject, Rpattern=pattern,
>> + max.Rmismatch=0 )
>> 17-letter "DNAString" instance
>> seq: CGCCGGCCGAAATTTAA
>>
>> This is a bit dis-continuous, isn't it? ;-)
>>
>>
>> Cheers
>> Simon
>>
>> > sessionInfo()
>> R version 2.11.0 Under development (unstable) (--)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] ShortRead_1.5.15 lattice_0.17-26 BSgenome_1.15.4
>> Biostrings_2.15.21
>> [5] IRanges_1.5.46
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.7.2 grid_2.11.0 hwriter_1.1 tools_2.11.0
>>
>>
>> +---
>> | Dr. Simon Anders, Dipl.-Phys.
>> | European Molecular Biology Laboratory (EMBL), Heidelberg
>> | office phone +49-6221-387-8632
>> | preferred (permanent) e-mail: sanders at fs.tum.de
>
More information about the Bioc-sig-sequencing
mailing list