[Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Hervé Pagès hpages at fhcrc.org
Mon Jun 1 20:16:31 CEST 2009


Hi Paul,

Thanks for the feedback! I just set up a script that does something similar
to your code and was able to trigger the spooky/infamous bug. The first time
after 140 iterations (55 sec.), the second time after 73, etc... Seems like
I always get it after a few dozens of iterations. This is great because
that's going to make troubleshooting this issue a little bit easier!

BTW, if you are interested in speeding up your code (and at the same time, you
might get rid of the spooky bug), you can now take advantage of the fact that
getSeq() is vectorized (starting with BSgenome >= 1.12.2). Given that there is
also a vectorized version of countPattern() (it's the vcountPattern() function),
then you don't need the "for" loop anymore:

   myb.consen.count <- character(nrow(peaks))
   all.genomic <- DNAStringSet(getSeq(Mmusculus, peaks$chr,
                                      peaks$start, peaks$end))
   c.1 <- vcountPattern(myb.dna, all.genomic, fixed=FALSE)
   c.2 <- vcountPattern(myb.dna.comp, all.genomic, fixed=FALSE)
   c.3 <- vcountPattern(myb.dna.rev, all.genomic, fixed=FALSE)
   c.4 <- vcountPattern(myb.dna.comp.rev, all.genomic, fixed=FALSE)
   ## Now c.1, c.2, c.3 and c.4 are integer vectors of the same length
   ## as 'all.genomic'
   the.sum <- c.1 + c.2 + c.3 + c.4
   the.consen.count <- paste("F:", c.1, "FC:", c.2, "FRev:", c.3, "FCRev:", c.4)
   myb.consen.count[the.sum > 0] <- the.consen.count[the.sum > 0]

This is thousands of times faster than calling getSeq() and countPattern()
inside a "for" loop. Also, it doesn't seem to trigger the spooky/infamous
bug :-)

Thanks again,
H.


Paul Leo wrote:
> Sorry if this is not on topic but I think I can generate *this* spooky
> bug with this simple code, I post in case it helps:
> 
> library(BSgenome.Mmusculus.UCSC.mm9)
> all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends)
> 
> myb<-"YAACKG"        
> myb.dna<-DNAString(myb)
> myb.dna.rev<-reverse(myb.dna)
> myb.dna.comp<-complement(myb.dna)
> myb.dna.comp.rev<-reverseComplement(myb.dna)
> 
> ### peaks is just a set of regions - not large but lots of them 
> #> peaks[1:5,1:4]
>  #     chr     start       end length
> #7568   14 119073660 119074474    815
> #11980  19  41922512  41923422    911
> #9375   16  70351968  70352514    547
> #158     1  33635080  33636130   1051
> #15576   3 151996276 151997236    961
> 
> for(i in 1:dim(peaks)[1]){
> 
> genomic<-getSeq(Mmusculus,the.chrom[i],start=starts[i],end=ends[i],as.character=FALSE)
> print(length(genomic)  )
> 
> c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)
> c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE) 
> c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)
> c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0, fixed=FALSE)
> the.sum<-(c.1+c.2+c.3+c.4)
> print(paste("counter:",i,"sites:",the.sum,sep=" "))
> if( the.sum >
> 0){myb.consen.count[i,]<-paste("F:",c.1,"FC:",c.2,"FRev:",c.3,"FCRev:",c.4,sep=" ")}
> }
> 
> This loop fails randomly ever few hundred iterations with no apparent
> reason all the variables are fine , you can restart at the error point
> and it goes fine for another few hundred then dies again....
> 
> eg
> 
> first dies at 
> counter 103, 
> .......
> [1] "counter: 102 sites: 5"
> [1] 729
> Error in assign(".nextMethod", nextMethod, envir = callEnv) : 
>   formal argument "envir" matched by multiple actual arguments
> Error in isEmpty(xx) : 
>   error in evaluating the argument 'x' in selecting a method for
> function 'isEmpty'
> Error in countPattern(pattern, toXStringViewsOrXString(subject),
> algorithm = algorithm,  : 
>   error in evaluating the argument 'subject' in selecting a method for
> function 'countPattern'
>> genomic
>   729-letter "MaskedDNAString" instance (# for masking)
> seq:
> ACCAGGCTCCGGACGCCTTCTCCCCAGCGTTTTGCA...ATTGGGTTGGAAACTTTGTAGGAAGGAAAGGAAAAT
> masks:
> 
> #### Genomic is the corect length and is ok..straight away can do:
> 
>> c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)
>> c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE)
> #SAME matchPattern(myb.dna,complement(genomic),max.mismatch=0,
> fixed=FALSE)  
>> c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)
>> c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0,
> fixed=FALSE)
>> the.sum<-(c.1+c.2+c.3+c.4)
>> print(paste("counter:",i,"sites:",the.sum,sep=" "))
> [1] "counter: 103 sites: 3"
> ### see no problem what bug???
> 
> 
> ###### restart at i=103...it dies again at 604
> 
> [1] 737
> [1] "counter: 604 sites: 3"
> [1] 834
> Error in assign(".defined", method at defined, envir = envir) : 
>   formal argument "envir" matched by multiple actual arguments
> 
> 
> ## I will use Herves last post to get about this.....Can run on the ##
> development version too if that helps??
> 
>> sessionInfo()
> R version 2.9.0 (2009-04-17) 
> x86_64-pc-linux-gnu 
> 
> locale:
> LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base     
> 
> other attached packages:
>  [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11
> BSgenome_1.12.0                   
>  [3] Biostrings_2.12.0
> IRanges_1.2.1                     
>  [5] KEGG.db_2.2.11
> RSQLite_0.7-1                     
>  [7] DBI_0.2-4
> AnnotationDbi_1.6.0               
>  [9] Biobase_2.4.0
> biomaRt_2.0.0                     
> 
> loaded via a namespace (and not attached):
> [1] RCurl_0.94-1 tcltk_2.9.0  tools_2.9.0  XML_1.95-3  
> 
> 
> -  
> Dr Paul Leo
> Bioinformatician
> Diamantina Institute for Cancer, Immunology and Metabolic Medicine
> University of Queensland
> --------------------------------------------------------------------------------------
> Research Wing, Bldg 1
> Princess Alexandria Hospital 
> Woolloongabba, QLD, 4102
> Tel: +61 7 3240 7740  Mob: 041 303 8691  Fax: +61 7 3240 5946
> Email: p.leo at uq.edu.au   Web: http://www.di.uq.edu.au
> 
> 
> -----Original Message-----
> From: Hervé Pagès <hpages at fhcrc.org>
> To: Ivan Gregoretti <ivangreg at gmail.com>
> Cc: bioc-sig-sequencing at r-project.org
> Subject: Re: [Bioc-sig-seq] Extracting DNA sequences from
> BSgenome.Mmusculus.UCSC.mm9_1.3.11
> Date: Thu, 28 May 2009 17:21:15 -0700
> 
> Ivan,
> 
> With BSgenome 1.12.1 (release) and 1.13.5 (devel) you can now do:
> 
>    myseqs <- data.frame(
>      chr=c("chrY", "chr1", "chr2", "chr3", "chrY", "chr3", "chr1", "chr1"),
>      start=c(NA, -40, 8510201, 4920301, 30001, 9220500, -2804, -30),
>      end=c(50, NA, 8510220, 4920330, 30011, 9220555, -2801, -11)
>    )
> 
>    library(BSgenome.Mmusculus.UCSC.mm9)
> 
>    > getSeq(Mmusculus, myseqs$chr, myseqs$start, myseqs$end)
>    [1] "GATCCAAAACACATTCTCCCTGGTAGCATGGACAAGCAACATTTTGGGAG"
>    [2] "TTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC"
>    [3] "ACGACTATAAAAACCTTTAG"
>    [4] "CATACAATAATTGTGGGGGAACTTCAAAAC"
>    [5] "ATCTTAATCAC"
>    [6] "CAGTAGTGGCGTACACCTTTAATCCCAGCACGTGGTAGGCAGAGGCAGATGGATTT"
>    [7] "ATGA"
>    [8] "AATTTGGTATTAAACTTAAA"
> 
> to extract multiple subsequences from multiple chromosomes at once.
> (Note support for NAs and negative start or end.)
> 
> It takes about 35 sec. to extract 1 million short subsequences so it's
> much faster than the sapply() solution I proposed earlier but there is
> still room for improvement. Also, the extracted sequences are currently
> returned as a character vector, not a DNAStringSet.
> But addressing these 2 issues (speed + DNAStringSet) will require that
> we first put in place an efficient implementation for combining several
> DNAStringSet objects (this is an important piece of the infrastructure).
> So that will take a few weeks.
> 
> Hopefully this time you won't get hit by the infamous bug you reported
> earlier (BTW anything new on that front? Were you able to reproduce it?
> Thanks).
> 
> Cheers,
> H.
> 
> PS: The latest BSgenome software packages will propagate to the public
> repos in the next 24 hours.
> 
> 
> Ivan Gregoretti wrote:
>> I didn't notice that I was emailing the individuals rather than the
>> list!!!!!!
>>
>> Briefly, THANKS TO EVERYONE.
>>
>> Second, I managed to call and display all my sequences like this
>>
>> sapply(1:dim(A)[1], function(i) paste(A$chr[i],' ',A$start[i],' ',A$end[i],'
>> tags: ',A$tags[i],' ',getSeq(Mmusculus, as.character(A$chr[i]), A$start[i],
>> A$end[i])))
>>
>> As you see, it's made up of ideas from all three respondents. (Michael's
>> solution actually gave me an idea to solve some other problem not mentioned
>> here. So, nothing gets wasted.)
>>
>> Sorry for the off-list emailing issue.
>>
>> Ivan
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list