[Bioc-sig-seq] matchPWM() without PWMscoreStartingAt()

Fri Feb 25 23:46:07 CET 2011

Ivan,

On 02/24/2011 01:47 PM, Ivan Gregoretti wrote:
> Hi everybody,
>
> The function matchPWM() that is invoked like this
>
> matchPWM(pwm, subject, min.score="80%", ...)
>
> provides a very fast way of identifying the positions in 'subject' that
> match a Position Weight Matrix with a minimum score of 80 percent (relative
> to the maximum score).
>
> The result looks like this
>
> matchPWM(pwmMa,Mmusculus[['chr19']],min.score="90%")
>   Views on a 61342430-letter DNAString subject
> subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNN...TATCTCCGAGATGCACCCGCCTGCTTGC
> views:
>           start      end width
>   [1]  3499390  3499407    18 [TTTTGGTCATGGTGCTTC]
>    [2]  3847991  3848008    18 [TTGAATTCCTGATCCTTC]
>   [3]  4433615  4433632    18 [TTTGATTCCTGATTCTTC]
>    [4]  5200625  5200642    18 [TCTTAGTCATGGTGCTTT]
>   [5]  6475713  6475730    18 [TCTTGTTTCTGCTGCTTC]
>    [6]  8899734  8899751    18 [TTTCATTCCTGTTTCTTT]
>   [7]  9435046  9435063    18 [TTGTGTTCCTGCTGCTTC]
>    [8]  9490297  9490314    18 [TTTTATTCATGCTGTTTC]
>   [9] 10148754 10148771    18 [TCATGGTCATGGTGCTTC]
>    ...      ...      ...   ... ...
> [103] 53870063 53870080    18 [TCAAAGTCATGCTCCTTC]
> [104] 55886021 55886038    18 [TCTTTTTCCTGGTACTTC]
> [105] 56158096 56158113    18 [GCTGATTCTTGCTGCTTC]
> [106] 56585172 56585189    18 [TTTCATTCCAGTTGCTTA]
> [107] 56905278 56905295    18 [TTCTGTTCCTGGTACTTC]
> [108] 57831404 57831421    18 [TTTTATTTATGTTACTTC]
> [109] 58794481 58794498    18 [TCGAATTCCTGCTGCTTT]
> [110] 59452136 59452153    18 [TTGCATTTGTGTTACTTC]
> [111] 59597303 59597320    18 [TTTTAGTCATGGTGTTTC]
>
>
> My question:
>
> How do you get matchPWM to report the score?
>
>
> Why that is important:
>
> The function PWMscoreStartingAt() was created to answer that question. Now,
> for some PWMs, running matchPWM with a permissive min.score like "80%"
> results in half a million matches. The problem is that PWMscoreStartingAt()
> can compute the scores one match at a time. In other words, it takes a
> DNAString but not a DNAStringSet.

Computing the scores of all the matches is easy and very very fast,
even when there are millions of matches:

 > library(Biostrings)
 > data(HNF4alpha)
 > pwm <- PWM(HNF4alpha)
 > library(BSgenome.Mmusculus.UCSC.mm9)
 > subject <- Mmusculus[["chr19"]]
 > m <- matchPWM(pwm, subject, min.score="60%")
 > length(m)
[1] 4924385
 > system.time(scores <- PWMscoreStartingAt(pwm, unmasked(subject), 
starting.at=start(m)))
    user  system elapsed
   0.100   0.012   0.114
 > length(scores)
[1] 4924385
 > head(scores)
[1] 0.6609854 0.6421621 0.6924616 0.6075919 0.6024951 0.6281836
 >

Note that you don't need to use a DNAStringSet for this.

But I take your point that having the scores automatically attached to
the matches would be convenient.

More on your other matchPWM's related request in the next email.

Cheers,
H.

>
> Rather than hoping for a vercotised version of PWMscoreStartingAt(), I guess
> that it is more practical to try to retrieve the score computed by
> matchPWM().
>
> Any suggestions?
>
> Thank you,
>
> Ivan
>
>> sessionInfo()
> R version 2.13.0 Under development (unstable) (2010-12-02 r53747)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C              LC_TIME=en_US.utf8
>
>   [4] LC_COLLATE=en_US.utf8     LC_MONETARY=C
> LC_MESSAGES=en_US.utf8
> [7] LC_PAPER=en_US.utf8       LC_NAME=C                 LC_ADDRESS=C
>
> [10] LC_TELEPHONE=C            LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
>
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] multicore_0.1-4                    lattice_0.19-17
> [3] BSgenome.Mmusculus.UCSC.mm9_1.3.17 BSgenome_1.19.3
> [5] cosmo_1.17.0                       seqLogo_1.17.0
> [7] GenomicRanges_1.3.18               Biostrings_2.19.9
> [9] IRanges_1.9.22
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.11.8 tools_2.13.0
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319