[Bioc-sig-seq] trimLRPatterns vs vmatchPattern
Harris A. Jaffee
hj at jhu.edu
Tue Jan 26 15:55:49 CET 2010
You can get the with.Lindels=TRUE behavior you want by reversing your
data
and calling trimLRPatterns with with.Rindels=TRUE and an Rpattern
equal to
the reverse of your Lpattern, and then reversing the results.
But the behavior you like with indels might be considered a bug, or
at least
something that trimLRPatterns shouldn't do, in the sense that the
match you
like isn't flanking, so might be artificial. Also, setting the
max.mismatch
values is difficult because one can't know in advance how many
letters of the
current pattern will extend off the edge of the subject, which
depends on the
match, i.e. how many inserts into the pattern will be required.
Worse, "the"
match-with-indels is not at all unique. In my mind, it also isn't
clear that
Herve's 'best local match' (shortest one with the minimal edit
distance) is
what trimLRPatterns wants. It might want the longest match with the
minimal
edit distance, the longest match with maximal allowed edit distance,
etc.
I have a version that demands that matches-with-indels start at the
beginning
of the subject, for Lpattern, and end at the end of the subject, for
Rpattern.
[The ending part would also not be implemented yet, if I needed
Proffset, but
I do it by the reversing trick.] I trim by the length of the current
pattern
rather than the length of the match. This is a one-size-fits-all
solution,
since the match could be shorter or longer. Maybe I should trim by
the length
of the pattern plus the number of allowed edits. The max.mismatch
limits are
less circular, so easier to set.
I was about to submit a patch, but now it's open to debate.
On Jan 25, 2010, at 9:54 PM, Marcus Davy wrote:
> Hi,
> I have some 454 data which is expected to contain a primer at the
> beginning.
>
> I noticed that using trimLRPatterns to trim an Lpattern at the 5'
> start of a
> sequence with 1 mismatch does not allow matches to the subject
> sequence up
> to the number of mismatches directly to the right of the sequence
> start (and
> also to the left of an Rpattern at the 3' end).
>
> So if my example primer Lpattern is 23 bases in length,
> trimLRPatterns with
> 1 mismatch will match all sequences at positions 0 to 22 and 1 to
> 23, but
> not 2 to 24 whereas if you use vmatchPattern, it will find all
> three of
> these positions.
>
> My question is should an option/arg in trimLRPatterns be made
> available that
> allows matches to the right of the Lpattern up to the number of
> mismatches,
> and to the left of the Rpattern up to the number of mismatches?
>
> I think the arg e.g. 'with.Lindels=TRUE' (which appears to have not
> been
> enabled yet in Biostrings_2.14.8) may partially resolve this if the
> insertion is somewhere at the beginning of a subject sequence
> (within the
> number of mismatches allowed) which can then be matched by the
> Lpattern.
>
>
>> mismatches <- 1
>> pattern <- "AAGCAGTGGTATCAACGCAGAGT"
>> w <- width(pattern)
>> mm <- rep(mismatches, w)
>> base <- "G"
>> n <- 20
>>
>> subjectList <- list(
> + "Primer0+poly(T)" = paste(substring(pattern,
> 2,w),
> polyn(base, n), sep=""),
> + "Primer1+poly(T)" = paste(pattern, polyn
> (base, n),
> sep=""),
> + "Primer2+poly(T)" = paste("A", substring
> (pattern,1,w),
> polyn(base,n), sep="")
> + )
>>
>> subjectSet <- DNAStringSet(unlist(subjectList))
>>
>> print(subjectSet)
> A DNAStringSet instance of length 3
> width seq names
>
> [1] 42 AGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer0
> +poly(T)
> [2] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer1
> +poly(T)
> [3] 44 AAAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer2
> +poly(T)
>> cat("Primer: ", pattern, "\n")
> Primer: AAGCAGTGGTATCAACGCAGAGT
>>
>> ## trimLRPatterns -LPattern
>> LRcoords <- trimLRPatterns(Lpattern = pattern, subject =
>> subjectSet,
> max.Lmismatch=mm, ranges=TRUE, with.Lindels=FALSE)
>> ## Fails to match subjectSet[3:4] - trimming with mismatches is to
>> the
> left of Lpattern (and to the right of Rpattern)
>> DNAStringSet(subjectSet,start(LRcoords), end(LRcoords))
> A DNAStringSet instance of length 3
> width seq names
>
> [1] 20 GGGGGGGGGGGGGGGGGGGG Primer0
> +poly(T)
> [2] 20 GGGGGGGGGGGGGGGGGGGG Primer1
> +poly(T)
> [3] 43 AAGCAGTGGTATCAACGCAGAGTGGGGGGGGGGGGGGGGGGGG Primer2
> +poly(T)
>>
>> ## vmatchPattern
>> tmp <- vmatchPattern(pattern, subjectSet, max.mismatch=mismatches)
>> matchIndex <- which(as.logical(countIndex(tmp)))
>> ## only one match per sequence so indices preserved with unlist
>> VMcoords <- unlist(tmp)
>> ## Matches to subjectSet[3]
>> DNAStringSet(subjectSet[matchIndex],end(VMcoords)+1,
> width(subjectSet)[matchIndex])
> A DNAStringSet instance of length 3
> width seq names
>
> [1] 20 GGGGGGGGGGGGGGGGGGGG Primer0
> +poly(T)
> [2] 20 GGGGGGGGGGGGGGGGGGGG Primer1
> +poly(T)
> [3] 20 GGGGGGGGGGGGGGGGGGGG Primer2
> +poly(T)
>
> cheers,
>
>
> Marcus
>
>
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> powerpc-apple-darwin8.11.1
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] hgu95av2probe_2.5.0 AnnotationDbi_1.8.1 Biobase_2.6.1
> [4] ShortRead_1.4.0 lattice_0.17-26 BSgenome_1.14.2
> [7] Biostrings_2.14.8 IRanges_1.4.9
>
> loaded via a namespace (and not attached):
> [1] DBI_0.2-4 RSQLite_0.7-3 grid_2.10.0 hwriter_1.1
> tools_2.10.0
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list