[Bioc-sig-seq] matchPWM() without PWMscoreStartingAt()
Hervé Pagès
hpages at fhcrc.org
Fri Feb 25 23:46:07 CET 2011
Ivan,
On 02/24/2011 01:47 PM, Ivan Gregoretti wrote:
> Hi everybody,
>
> The function matchPWM() that is invoked like this
>
> matchPWM(pwm, subject, min.score="80%", ...)
>
> provides a very fast way of identifying the positions in 'subject' that
> match a Position Weight Matrix with a minimum score of 80 percent (relative
> to the maximum score).
>
> The result looks like this
>
> matchPWM(pwmMa,Mmusculus[['chr19']],min.score="90%")
> Views on a 61342430-letter DNAString subject
> subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNN...TATCTCCGAGATGCACCCGCCTGCTTGC
> views:
> start end width
> [1] 3499390 3499407 18 [TTTTGGTCATGGTGCTTC]
> [2] 3847991 3848008 18 [TTGAATTCCTGATCCTTC]
> [3] 4433615 4433632 18 [TTTGATTCCTGATTCTTC]
> [4] 5200625 5200642 18 [TCTTAGTCATGGTGCTTT]
> [5] 6475713 6475730 18 [TCTTGTTTCTGCTGCTTC]
> [6] 8899734 8899751 18 [TTTCATTCCTGTTTCTTT]
> [7] 9435046 9435063 18 [TTGTGTTCCTGCTGCTTC]
> [8] 9490297 9490314 18 [TTTTATTCATGCTGTTTC]
> [9] 10148754 10148771 18 [TCATGGTCATGGTGCTTC]
> ... ... ... ... ...
> [103] 53870063 53870080 18 [TCAAAGTCATGCTCCTTC]
> [104] 55886021 55886038 18 [TCTTTTTCCTGGTACTTC]
> [105] 56158096 56158113 18 [GCTGATTCTTGCTGCTTC]
> [106] 56585172 56585189 18 [TTTCATTCCAGTTGCTTA]
> [107] 56905278 56905295 18 [TTCTGTTCCTGGTACTTC]
> [108] 57831404 57831421 18 [TTTTATTTATGTTACTTC]
> [109] 58794481 58794498 18 [TCGAATTCCTGCTGCTTT]
> [110] 59452136 59452153 18 [TTGCATTTGTGTTACTTC]
> [111] 59597303 59597320 18 [TTTTAGTCATGGTGTTTC]
>
>
> My question:
>
> How do you get matchPWM to report the score?
>
>
> Why that is important:
>
> The function PWMscoreStartingAt() was created to answer that question. Now,
> for some PWMs, running matchPWM with a permissive min.score like "80%"
> results in half a million matches. The problem is that PWMscoreStartingAt()
> can compute the scores one match at a time. In other words, it takes a
> DNAString but not a DNAStringSet.
Computing the scores of all the matches is easy and very very fast,
even when there are millions of matches:
> library(Biostrings)
> data(HNF4alpha)
> pwm <- PWM(HNF4alpha)
> library(BSgenome.Mmusculus.UCSC.mm9)
> subject <- Mmusculus[["chr19"]]
> m <- matchPWM(pwm, subject, min.score="60%")
> length(m)
[1] 4924385
> system.time(scores <- PWMscoreStartingAt(pwm, unmasked(subject),
starting.at=start(m)))
user system elapsed
0.100 0.012 0.114
> length(scores)
[1] 4924385
> head(scores)
[1] 0.6609854 0.6421621 0.6924616 0.6075919 0.6024951 0.6281836
>
Note that you don't need to use a DNAStringSet for this.
But I take your point that having the scores automatically attached to
the matches would be convenient.
More on your other matchPWM's related request in the next email.
Cheers,
H.
>
> Rather than hoping for a vercotised version of PWMscoreStartingAt(), I guess
> that it is more practical to try to retrieve the score computed by
> matchPWM().
>
> Any suggestions?
>
> Thank you,
>
> Ivan
>
>> sessionInfo()
> R version 2.13.0 Under development (unstable) (2010-12-02 r53747)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C LC_TIME=en_US.utf8
>
> [4] LC_COLLATE=en_US.utf8 LC_MONETARY=C
> LC_MESSAGES=en_US.utf8
> [7] LC_PAPER=en_US.utf8 LC_NAME=C LC_ADDRESS=C
>
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
>
> attached base packages:
> [1] grid stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] multicore_0.1-4 lattice_0.19-17
> [3] BSgenome.Mmusculus.UCSC.mm9_1.3.17 BSgenome_1.19.3
> [5] cosmo_1.17.0 seqLogo_1.17.0
> [7] GenomicRanges_1.3.18 Biostrings_2.19.9
> [9] IRanges_1.9.22
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.11.8 tools_2.13.0
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list