[Bioc-sig-seq] finding matches to a motif

Thu Jan 27 01:46:39 CET 2011

Hello Paul,

First, thank you for sharing your code.

Right now, I am using a work flow of my own creation that is
philosophically identical to your solution.

However, I still have a question:

How did you handle true and false positives? For instance, by lowering
min.score you can get matchPWM to report thousands of matches. There
is a point at which the naked eye can tell that some matches are false
positives.

Have you or anybody designed a statistically defensible strategy to
call reliable genomic targets starting from a PWM?

By defensible I mean publishable.

Thank you,

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878

On Wed, Jan 26, 2011 at 6:55 PM, Paul Leo <p.leo at uq.edu.au> wrote:
>
> Hi Ivan,
> Here is just a straight cut and paste for something I did some time back if gets the location and the counts of alignments with a PWM  that pass a certain score for a set of regions . If you have a lot of regions using XstringViews will speed things up. This is only rough code ...
>
> here "the.chrom" "start" and "end" are large vectors containing the regions that you wnat to search:
> the.chrom=c("chr1","chr1","chr3"...)
> starts=c(1000,2000,2000...)
> end=c(2000,3000,4000....)
>
> library("BSgenome.Mmusculus.UCSC.mm9")
> Mmusculus
> all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends)
> length(all.genomic)
>
>
> system.time(x<- XStringViews(all.genomic, "DNAString"))
>
> ## $transMat$order0
> ##            A         C         G         T
> ## -- 0.2547153 0.2453616 0.2451361 0.2547870
>
>
> x.labels<-paste(the.chrom,starts,ends,sep=":")
> names(x)<-x.labels
>
>
>
> #################################
>        ## ## ---------------------------------------------------------------------
>        ## ## C. SEARCHING THE MINUS STRAND OF A CHROMOSOME
>        ## ## ---------------------------------------------------------------------
>        ## ## Applying reverseComplement() to the pattern before calling
>        ## ## matchPattern() is the recommended way of searching hits on the
>        ## ## minus strand of a chromosome.
>
>
>
> mime.ev<-c( 0.129955,  0.433398,  0.207277,  0.229370,
> 0.196881,  0.480832,  0.009747,  0.312541,
> 0.978558,  0.020143,  0.000650,  0.000650,
> 0.988304,  0.000650,  0.011046 , 0.000000,
> 0.000000,  1.000000,  0.000000,  0.000000,
> 0.181936,  0.071475,  0.175439,  0.571150,
> 0.000000,  0.004548,  0.990903,  0.004548,
> 0.232619,  0.421702,  0.045484,  0.300195,
> 0.250162,  0.576998,  0.016244,  0.156595,
> 0.452242,  0.178687,  0.133853,  0.235218,
> 0.198181,  0.338532,  0.190383,  0.272904)  ### this was the one used
>
> a.pwm<-mime.ev
>
>
>
> min.score="85.0%" #85 % for mime 87.5 for crank
> all.matches<-matchPWM(a.pwm,x,min.score=min.score) # needs a stringView to vectorize
> the.cov<-coverage(all.matches)
> counts<-aggregate(the.cov,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2]
> sum(counts!=0)
>
> all.matches.comp<-matchPWM(reverseComplement(a.pwm),x,min.score=min.score)  #verified using seqlogo as ok
> the.cov.comp<-coverage(all.matches.comp)
> counts.comp<-aggregate(the.cov.comp,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2]
> sum(counts.comp!=0)
>
> ############################################### Check frequency at a position
> ###############################################
>
> counts.t<-counts+counts.comp
> ###################### forward posns #################
> loc.f=rep(NA,times=dim(all.data80)[1])
> loc.hit<-start(all.matches)
> region.start<-start(x)
> hits<-grep(TRUE,counts>0) # places where a hits is found
> k<-1
> no.hits<-counts[counts>0]
> for(i in 1:length(hits)){
>   loc.f[hits[i]]<-toString((loc.hit[k:(k+no.hits[i]-1)]-region.start[hits[i]])+1)
>   k<-k+no.hits[i]
> }
> ######################################################
>
> ###################### revearse posns #################
> loc.c=rep(NA,times=dim(all.data80)[1])
> loc.hit<-start(all.matches.comp)
> region.start<-start(x)
> hits<-grep(TRUE,counts.comp>0) # places where a hits is found
> k<-1
> no.hits<-counts.comp[counts.comp>0]
> for(i in 1:length(hits)){
>   loc.c[hits[i]]<-toString((loc.hit[k:(k+no.hits[i]-1)]-region.start[hits[i]])+1)
>   k<-k+no.hits[i]
> }
> ######################################################
>
> ######combine loc.f and loc.c all w.r.t forward strand posn see doc after ?reverseComplement
> has.c<-!is.na(loc.c)
> has.f<-!is.na(loc.f)
> loc.all<-loc.c
> loc.all[has.c & has.f]<-paste(loc.all[has.c & has.f],loc.f[has.c & has.f],sep=", ") # both present
> loc.all[!has.c & has.f]<-loc.f[!has.c & has.f] # new in loc.f
>
> ## cbind(loc.c,loc.f,loc.all)[1:50,] # a check
>
>
> you can replace some of the for loops above with a vmatch ...
>
>
>
>
>
>
> -----Original Message-----
> From: Ivan Gregoretti <ivangreg at gmail.com>
> To: bioc-sig-sequencing at r-project.org
> Subject: [Bioc-sig-seq] finding matches to a motif
> Date: Tue, 25 Jan 2011 10:36:14 -0500
>
> Hello everybody,
>
> Introduction:
> I have a Position Weigh Matrix that characterises the binding DNA
> sequence of a novel transcription factor.
>
> Goal:
> I want to find the genomic positions that are good matches for my PWM.
>
> Question:
> What are my options among the Bioconductor packages?
>
> Thank you,
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing