[Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11
Hervé Pagès
hpages at fhcrc.org
Thu May 28 23:39:51 CEST 2009
Again, can you please keep this discussion on the list? Especially
if you are commenting on Kasper's or Michael's suggestions, you should
at least include them. None of the solutions we gave you is totally
satisfying, some even don't work, but that's OK, we'll try to fix this ;-)
Thanks!
more below...
Ivan Gregoretti wrote:
> Hi Hervé,
>
> Any idea why it complains?
>
>> DNAStringSet(sapply(seq_len(nrow(A)),
> + function(i)
> + getSeq(Mmusculus,
> + as.vector(A$chr[i]),
> + start=A$start[i], end=A$end[i])))
> Error in assign(".defined", method at defined, envir = envir) :
> formal argument "envir" matched by multiple actual arguments
This is an infamous bug that has already hit other people before but in
situations that have (apparently) nothing to do with the above code.
Sometimes the IRanges/Biostrings/BSgenome packages are not even involved:
https://stat.ethz.ch/pipermail/r-devel/2006-April/037199.html
Some people (including me) have already tried to look at it and suspect
that it could be related with some buffer overflow issue in R itself or
in the methods package. It seems to happen when big data is involved but it's
hard to reproduce (it won't necessarily happen again even when using exactly
twice the same data). The fact that this bug occurred when running code as
simple as the above code is new though. Can you reproduce it? Can you reduce
the size of A and still get the bug. That would be great to have some simple
code + small data that allow us to reproduce the bug (or at least have the
bug very likely to occur) so we could report this to the R-devel mailing list.
In the meantime, I'm working on parallelizing getSeq(). The sapply()
solution above is too slow anyway if you have hundreds of thousands of
sequences to extract. I'll post here again when it's ready.
H.
> Error in XStringSet("DNA", x, start = start, end = end, width = width, :
> error in evaluating the argument 'x' in selecting a method for
> function 'XStringSet'
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-redhat-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.12.0
> [3] Biostrings_2.12.1 IRanges_1.2.2
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.0 tools_2.9.0
>
>
> By the way, Kasper's "by hand" method gets the sequences although with
> some display problems (large blank spaces between sequences).
>
> Michael's suggestion also sounded good. I converted my data.frame to a
> RangedData object, let's call it rdA. Then I tried it
>
> # peek at the data
>> head(as.data.frame(rdA))
> space start end width length summit tags Tpvalue fold_enrichment FDR
> 1 chr1 3644952 3649720 4769 4769 2228 449 3100.00 49.63 0.00
> 2 chr1 4599146 4601342 2197 2197 944 34 82.95 8.03 0.03
> 3 chr1 5015865 5018830 2966 2966 1663 58 92.32 6.69 0.03
> 4 chr1 5072928 5076881 3954 3954 1898 160 755.56 22.67 0.00
> 5 chr1 5504220 5507065 2846 2846 1575 68 249.56 11.85 0.04
> 6 chr1 5513886 5516391 2506 2506 1561 67 273.03 11.85 0.02
>
> then I do
>
>> rdA$sequence <- do.call("c", lapply(names(rdA), function(chr) {DNAStringSet(Views(Mmusculus[[chr]], ranges(rdA)[[chr]]))}))
> Error in `[[<-`(`*tmp*`, name, value = list(<S4 object of class
> "DNAStringSet">, :
> invalid replacement length
> In addition: There were 21 warnings (use warnings() to see them)
>
>
> So, I can't call it a success yet.
>
> Thanks,
>
> Ivan
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list