[Bioc-sig-seq] Counting mismatches with vmatchPattern

Hervé Pagès hpages at fhcrc.org
Thu Aug 12 22:16:58 CEST 2010


Hi Marcus,

On 08/09/2010 09:14 PM, Marcus Davy wrote:
> Hi,
>
> I am using vmatchPattern, to match a pattern against many subjects, I would
> like to count the actual numbers
> of mismatched bases for each successful match, where max.mismatch is set to
> be>0.
> I have found the functions 'nmatch', and 'nmismatch' which work with
> matchPattern,
>
> e.g.
> pattern<- "AATTCGGCACGAGG"
> subject<- DNAString("TTCGGCACGAGGTT")
> tmp<- matchPattern(pattern, subject, max.mismatch=2)
> ## Two mismatches
> print(nmismatch(toString(pattern), tmp))
>
> I have had a go at extending this for unique matches with vmatchPattern,
> storing the mismatches in a vector,
> but a vector does not capture multiple matches between a pattern and a
> subject, the output needs to be a list.
>
> pattern<- DNAStringSet("AATTCGGCACGAGG")

Betters to make pattern a DNAString object:

   pattern <- DNAString("AATTCGGCACGAGG")

> subjects<- DNAStringSet(c(
>                               "match-1" = "TTCGGCACGAGGTT",
>                               "match 0" = "ATTCGGCACGAGGT",
>                               "match+1" = "AATTCGGCACGAGG",
>                               "match+2" = "TAATTCGGCACGAG",
>                               "match+3" = "TTAATTCGGCACGA",
>                               "match+1 and +15" =
> "AATTCGGCACGAGGAATTCGGCACGAGG"
>                               )
>                             )
> print(subjects)
>
> tmp<- vmatchPattern(toString(pattern), subjects, max.mismatch=2)

toString() not needed anymore because pattern is a DNAString.

> print(startIndex(tmp))
> ind<- which(countIndex(tmp)==1)
> mismatches<- rep(NA, length(subjects))
>
> for(i in seq(subjects)) {
>    viewsList<- extractAllMatches(subjects[[i]], tmp)[i]
>    mismatches[i]<- nmismatch(toString(pattern), viewsList)
> }

It looks to me that

    viewsList<- extractAllMatches(subjects[[i]], tmp)[i]

won't do what you want in the general case. This call to
extractAllMatches():

   extractAllMatches(subjects[[i]], tmp)

converts *all* the matches stored in 'tmp' into views on
subjects[[i]], even the matches that don't hit 'subjects[[i]]'.
In fact, extractAllMatches() is not intended to be used on
an MIndex object returned by vmatchPattern() but only on
an MIndex object returned by matchPDict() (here all the matches
belong to the same subject).

Because of this, when you extract the i-th view from this Views
object with

   extractAllMatches(subjects[[i]], tmp)[i]

you are not guaranteed to get a valid view (i.e. a view that
corresponds to a match that actually hits the i-th subject).
It works only by chance in your case because all the subjects
before the i-th subject have exactly 1 hit.

A better way to generate the Views object for the i-th subject
is:

   Views(subjects[[i]], tmp[[i]])

So:

   for(i in seq(subjects)) {
      viewsList<- Views(subjects[[i]], tmp[[i]])
      mismatches[i] <- nmismatch(pattern, viewsList)
   }

>
> ## Second match at +15 in subjects[[6]] is never taken into account
> print(mismatches)

'mismatches' needs to be a list:

   mismatches <- vector("list", length(subjects))
   for(i in seq(subjects)) {
      viewsList<- Views(subjects[[i]], tmp[[i]])
      mismatches[[i]] <- nmismatch(pattern, viewsList)
   }

Note the use of [[ instead of [ in mismatches[[i]]
Then:

   > mismatches
   [[1]]
   [1] 2

   [[2]]
   [1] 1

   [[3]]
   [1] 0

   [[4]]
   [1] 1

   [[5]]
   [1] 2

   [[6]]
   [1] 0 0

>
> Incidentally this sapply equivalent was failing for me, I haven't yet solved
> why.
> ## sapply fails...
> sapply(seq(subjects), function(y){ nmismatch(toString(pattern),
> extractAllMatches(subjects[[y]], tmp)[y])})

Better to use lapply() because you want the result in a list, not in
an integer vector. But it seems that nmismatch() can't be used in that
context:

   > lapply(seq(subjects),
            function(i)
              nmismatch(pattern, Views(subjects[[i]], tmp[[i]])))
   Error in checkAndTranslateDbleBracketSubscript(x, i) :
     object 'i' not found

This is clearly a bug that has to do with how nmismatch() is
implemented:

   > selectMethod("nmismatch", c("ANY", "XStringViews"))
   Method Definition:

   function (pattern, x, fixed = TRUE)
   {
     funCall <- match.call()
     funCall[[1]] <- as.name("mismatch")
     mismatches <- eval(funCall, sys.parent())
     elementLengths(mismatches)
   }

I'll fix it.

>
> Is there a obvious way to extend these functions to vmatchPattern output of
> class MIndex, ideally a list in the
> same structural form as startIndex/endIndex? Is this an additional measure
> that vmatchPattern could also return?

Yes, that is a fair request. I'll add something like this. Maybe an
nmismatchIndex() function that would return a list with the same
shape as the list returned by startIndex/endIndex. I can't really
add a "nmismatch" method for the c(pattern"ANY", x="MIndex") because
the reference sequences need to be supplied but they are not stored
in the MIndex object so an extra argument would be needed.

I'm open to suggestions.

Cheers,
H.

>
>
> cheers,
>
>
> Marcus
>
>
> sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.16.9 IRanges_1.6.11    NGS_0.9.1
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0 tools_2.11.1
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list