[Bioc-sig-seq] finding matches to a motif
Ivan Gregoretti
ivangreg at gmail.com
Thu Jan 27 01:46:39 CET 2011
Hello Paul,
First, thank you for sharing your code.
Right now, I am using a work flow of my own creation that is
philosophically identical to your solution.
However, I still have a question:
How did you handle true and false positives? For instance, by lowering
min.score you can get matchPWM to report thousands of matches. There
is a point at which the naked eye can tell that some matches are false
positives.
Have you or anybody designed a statistically defensible strategy to
call reliable genomic targets starting from a PWM?
By defensible I mean publishable.
Thank you,
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878
On Wed, Jan 26, 2011 at 6:55 PM, Paul Leo <p.leo at uq.edu.au> wrote:
>
> Hi Ivan,
> Here is just a straight cut and paste for something I did some time back if gets the location and the counts of alignments with a PWM that pass a certain score for a set of regions . If you have a lot of regions using XstringViews will speed things up. This is only rough code ...
>
> here "the.chrom" "start" and "end" are large vectors containing the regions that you wnat to search:
> the.chrom=c("chr1","chr1","chr3"...)
> starts=c(1000,2000,2000...)
> end=c(2000,3000,4000....)
>
> library("BSgenome.Mmusculus.UCSC.mm9")
> Mmusculus
> all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends)
> length(all.genomic)
>
>
> system.time(x<- XStringViews(all.genomic, "DNAString"))
>
> ## $transMat$order0
> ## A C G T
> ## -- 0.2547153 0.2453616 0.2451361 0.2547870
>
>
> x.labels<-paste(the.chrom,starts,ends,sep=":")
> names(x)<-x.labels
>
>
>
> #################################
> ## ## ---------------------------------------------------------------------
> ## ## C. SEARCHING THE MINUS STRAND OF A CHROMOSOME
> ## ## ---------------------------------------------------------------------
> ## ## Applying reverseComplement() to the pattern before calling
> ## ## matchPattern() is the recommended way of searching hits on the
> ## ## minus strand of a chromosome.
>
>
>
> mime.ev<-c( 0.129955, 0.433398, 0.207277, 0.229370,
> 0.196881, 0.480832, 0.009747, 0.312541,
> 0.978558, 0.020143, 0.000650, 0.000650,
> 0.988304, 0.000650, 0.011046 , 0.000000,
> 0.000000, 1.000000, 0.000000, 0.000000,
> 0.181936, 0.071475, 0.175439, 0.571150,
> 0.000000, 0.004548, 0.990903, 0.004548,
> 0.232619, 0.421702, 0.045484, 0.300195,
> 0.250162, 0.576998, 0.016244, 0.156595,
> 0.452242, 0.178687, 0.133853, 0.235218,
> 0.198181, 0.338532, 0.190383, 0.272904) ### this was the one used
>
> a.pwm<-mime.ev
>
>
>
> min.score="85.0%" #85 % for mime 87.5 for crank
> all.matches<-matchPWM(a.pwm,x,min.score=min.score) # needs a stringView to vectorize
> the.cov<-coverage(all.matches)
> counts<-aggregate(the.cov,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2]
> sum(counts!=0)
>
> all.matches.comp<-matchPWM(reverseComplement(a.pwm),x,min.score=min.score) #verified using seqlogo as ok
> the.cov.comp<-coverage(all.matches.comp)
> counts.comp<-aggregate(the.cov.comp,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2]
> sum(counts.comp!=0)
>
> ############################################### Check frequency at a position
> ###############################################
>
> counts.t<-counts+counts.comp
> ###################### forward posns #################
> loc.f=rep(NA,times=dim(all.data80)[1])
> loc.hit<-start(all.matches)
> region.start<-start(x)
> hits<-grep(TRUE,counts>0) # places where a hits is found
> k<-1
> no.hits<-counts[counts>0]
> for(i in 1:length(hits)){
> loc.f[hits[i]]<-toString((loc.hit[k:(k+no.hits[i]-1)]-region.start[hits[i]])+1)
> k<-k+no.hits[i]
> }
> ######################################################
>
> ###################### revearse posns #################
> loc.c=rep(NA,times=dim(all.data80)[1])
> loc.hit<-start(all.matches.comp)
> region.start<-start(x)
> hits<-grep(TRUE,counts.comp>0) # places where a hits is found
> k<-1
> no.hits<-counts.comp[counts.comp>0]
> for(i in 1:length(hits)){
> loc.c[hits[i]]<-toString((loc.hit[k:(k+no.hits[i]-1)]-region.start[hits[i]])+1)
> k<-k+no.hits[i]
> }
> ######################################################
>
> ######combine loc.f and loc.c all w.r.t forward strand posn see doc after ?reverseComplement
> has.c<-!is.na(loc.c)
> has.f<-!is.na(loc.f)
> loc.all<-loc.c
> loc.all[has.c & has.f]<-paste(loc.all[has.c & has.f],loc.f[has.c & has.f],sep=", ") # both present
> loc.all[!has.c & has.f]<-loc.f[!has.c & has.f] # new in loc.f
>
> ## cbind(loc.c,loc.f,loc.all)[1:50,] # a check
>
>
> you can replace some of the for loops above with a vmatch ...
>
>
>
>
>
>
> -----Original Message-----
> From: Ivan Gregoretti <ivangreg at gmail.com>
> To: bioc-sig-sequencing at r-project.org
> Subject: [Bioc-sig-seq] finding matches to a motif
> Date: Tue, 25 Jan 2011 10:36:14 -0500
>
> Hello everybody,
>
> Introduction:
> I have a Position Weigh Matrix that characterises the binding DNA
> sequence of a novel transcription factor.
>
> Goal:
> I want to find the genomic positions that are good matches for my PWM.
>
> Question:
> What are my options among the Bioconductor packages?
>
> Thank you,
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list