[Bioc-sig-seq] trimLRPatterns vs vmatchPattern

Tue Jan 26 15:55:49 CET 2010

You can get the with.Lindels=TRUE behavior you want by reversing your  
data
and calling trimLRPatterns with with.Rindels=TRUE and an Rpattern  
equal to
the reverse of your Lpattern, and then reversing the results.

But the behavior you like with indels might be considered a bug, or  
at least
something that trimLRPatterns shouldn't do, in the sense that the  
match you
like isn't flanking, so might be artificial.  Also, setting the  
max.mismatch
values is difficult because one can't know in advance how many  
letters of the
current pattern will extend off the edge of the subject, which  
depends on the
match, i.e. how many inserts into the pattern will be required.   
Worse, "the"
match-with-indels is not at all unique.  In my mind, it also isn't  
clear that
Herve's 'best local match' (shortest one with the minimal edit  
distance) is
what trimLRPatterns wants.  It might want the longest match with the  
minimal
edit distance, the longest match with maximal allowed edit distance,  
etc.

I have a version that demands that matches-with-indels start at the  
beginning
of the subject, for Lpattern, and end at the end of the subject, for  
Rpattern.
[The ending part would also not be implemented yet, if I needed  
Proffset, but
I do it by the reversing trick.]  I trim by the length of the current  
pattern
rather than the length of the match.  This is a one-size-fits-all  
solution,
since the match could be shorter or longer.  Maybe I should trim by  
the length
of the pattern plus the number of allowed edits.  The max.mismatch  
limits are
less circular, so easier to set.

I was about to submit a patch, but now it's open to debate.

On Jan 25, 2010, at 9:54 PM, Marcus Davy wrote:

> Hi,
> I have some 454 data which is expected to contain a primer at the  
> beginning.
>
> I noticed that using trimLRPatterns to trim an Lpattern at the 5'  
> start of a
> sequence with 1 mismatch does not allow matches to the subject  
> sequence up
> to the number of mismatches directly to the right of the sequence  
> start (and
> also to the left of an Rpattern at the 3' end).
>
> So if my example primer Lpattern is 23 bases in length,  
> trimLRPatterns with
> 1 mismatch will match all sequences at positions 0 to 22 and 1 to  
> 23, but
> not 2 to 24 whereas if you use vmatchPattern, it will find all  
> three of
> these positions.
>
> My question is should an option/arg in trimLRPatterns be made  
> available that
> allows matches to the right of the Lpattern up to the number of  
> mismatches,
> and to the left of the Rpattern up to the number of mismatches?
>
> I think the arg e.g. 'with.Lindels=TRUE' (which appears to have not  
> been
> enabled yet in Biostrings_2.14.8) may partially resolve this if the
> insertion is somewhere at the beginning of a subject sequence   
> (within the
> number of mismatches allowed) which can then be matched by the  
> Lpattern.
>
>
>> mismatches <- 1
>> pattern    <- "AAGCAGTGGTATCAACGCAGAGT"
>> w          <- width(pattern)
>> mm         <- rep(mismatches, w)
>> base       <- "G"
>> n          <- 20
>>
>> subjectList <- list(
> +                     "Primer0+poly(T)" = paste(substring(pattern, 
> 2,w),
> polyn(base, n), sep=""),
> +                     "Primer1+poly(T)"  = paste(pattern, polyn 
> (base, n),
> sep=""),
> +                     "Primer2+poly(T)" = paste("A", substring 
> (pattern,1,w),
> polyn(base,n), sep="")
> +                     )
>>
>> subjectSet <- DNAStringSet(unlist(subjectList))
>>
>> print(subjectSet)
>   A DNAStringSet instance of length 3
>     width seq                                               names
>
> [1]    42 AGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG        Primer0 
> +poly(T)
> [2]    43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG       Primer1 
> +poly(T)
> [3]    44 AAAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG      Primer2 
> +poly(T)
>> cat("Primer:  ", pattern, "\n")
> Primer:  AAGCAGTGGTATCAACGCAGAGT
>>
>> ## trimLRPatterns -LPattern
>> LRcoords     <- trimLRPatterns(Lpattern = pattern, subject =  
>> subjectSet,
>  max.Lmismatch=mm, ranges=TRUE, with.Lindels=FALSE)
>> ## Fails to match subjectSet[3:4] - trimming with mismatches is to  
>> the
> left of Lpattern (and to the right of Rpattern)
>> DNAStringSet(subjectSet,start(LRcoords), end(LRcoords))
>   A DNAStringSet instance of length 3
>     width seq                                               names
>
> [1]    20 GGGGGGGGGGGGGGGGGGGG                              Primer0 
> +poly(T)
> [2]    20 GGGGGGGGGGGGGGGGGGGG                              Primer1 
> +poly(T)
> [3]    43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG       Primer2 
> +poly(T)
>>
>> ## vmatchPattern
>> tmp <- vmatchPattern(pattern, subjectSet, max.mismatch=mismatches)
>> matchIndex <- which(as.logical(countIndex(tmp)))
>> ## only one match per sequence so indices preserved with unlist
>> VMcoords <- unlist(tmp)
>> ## Matches to subjectSet[3]
>> DNAStringSet(subjectSet[matchIndex],end(VMcoords)+1,
> width(subjectSet)[matchIndex])
>   A DNAStringSet instance of length 3
>     width seq                                               names
>
> [1]    20 GGGGGGGGGGGGGGGGGGGG                              Primer0 
> +poly(T)
> [2]    20 GGGGGGGGGGGGGGGGGGGG                              Primer1 
> +poly(T)
> [3]    20 GGGGGGGGGGGGGGGGGGGG                              Primer2 
> +poly(T)
>
> cheers,
>
>
>  Marcus
>
>
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> powerpc-apple-darwin8.11.1
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] hgu95av2probe_2.5.0 AnnotationDbi_1.8.1 Biobase_2.6.1
> [4] ShortRead_1.4.0     lattice_0.17-26     BSgenome_1.14.2
> [7] Biostrings_2.14.8   IRanges_1.4.9
>
> loaded via a namespace (and not attached):
> [1] DBI_0.2-4     RSQLite_0.7-3 grid_2.10.0   hwriter_1.1    
> tools_2.10.0
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing