[Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Hervé Pagès hpages at fhcrc.org
Thu May 28 23:39:51 CEST 2009


Again, can you please keep this discussion on the list? Especially
if you are commenting on Kasper's or Michael's suggestions, you should
at least include them. None of the solutions we gave you is totally
satisfying, some even don't work, but that's OK, we'll try to fix this ;-)
  Thanks!

more below...

Ivan Gregoretti wrote:
> Hi Hervé,
> 
> Any idea why it complains?
> 
>> DNAStringSet(sapply(seq_len(nrow(A)),
> +                      function(i)
> +                        getSeq(Mmusculus,
> +                               as.vector(A$chr[i]),
> +                               start=A$start[i], end=A$end[i])))
> Error in assign(".defined", method at defined, envir = envir) :
>   formal argument "envir" matched by multiple actual arguments

This is an infamous bug that has already hit other people before but in
situations that have (apparently) nothing to do with the above code.
Sometimes the IRanges/Biostrings/BSgenome packages are not even involved:

   https://stat.ethz.ch/pipermail/r-devel/2006-April/037199.html

Some people (including me) have already tried to look at it and suspect
that it could be related with some buffer overflow issue in R itself or
in the methods package. It seems to happen when big data is involved but it's
hard to reproduce (it won't necessarily happen again even when using exactly
twice the same data). The fact that this bug occurred when running code as
simple as the above code is new though. Can you reproduce it? Can you reduce
the size of A and still get the bug. That would be great to have some simple
code + small data that allow us to reproduce the bug (or at least have the
bug very likely to occur) so we could report this to the R-devel mailing list.

In the meantime, I'm working on parallelizing getSeq(). The sapply()
solution above is too slow anyway if you have hundreds of thousands of
sequences to extract. I'll post here again when it's ready.

H.

> Error in XStringSet("DNA", x, start = start, end = end, width = width,  :
>   error in evaluating the argument 'x' in selecting a method for
> function 'XStringSet'
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-redhat-linux-gnu
> 
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.12.0
> [3] Biostrings_2.12.1                  IRanges_1.2.2
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.0 tools_2.9.0
> 
> 
> By the way, Kasper's "by hand" method gets the sequences although with
> some display problems (large blank spaces between sequences).
> 
> Michael's suggestion also sounded good. I converted my data.frame to a
> RangedData object, let's call it rdA. Then I tried it
> 
> # peek at the data
>> head(as.data.frame(rdA))
>   space   start     end width length summit tags Tpvalue fold_enrichment  FDR
> 1  chr1 3644952 3649720  4769   4769   2228  449 3100.00           49.63 0.00
> 2  chr1 4599146 4601342  2197   2197    944   34   82.95            8.03 0.03
> 3  chr1 5015865 5018830  2966   2966   1663   58   92.32            6.69 0.03
> 4  chr1 5072928 5076881  3954   3954   1898  160  755.56           22.67 0.00
> 5  chr1 5504220 5507065  2846   2846   1575   68  249.56           11.85 0.04
> 6  chr1 5513886 5516391  2506   2506   1561   67  273.03           11.85 0.02
> 
> then I do
> 
>> rdA$sequence <- do.call("c", lapply(names(rdA), function(chr) {DNAStringSet(Views(Mmusculus[[chr]], ranges(rdA)[[chr]]))}))
> Error in `[[<-`(`*tmp*`, name, value = list(<S4 object of class
> "DNAStringSet">,  :
>   invalid replacement length
> In addition: There were 21 warnings (use warnings() to see them)
> 
> 
> So, I can't call it a success yet.
> 
> Thanks,
> 
> Ivan

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list