[Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)
Ivan Gregoretti
ivangreg at gmail.com
Fri Jul 23 16:45:31 CEST 2010
Hello Patrick,
The idea of vcountPattern is good but it does not quite work for two reasons
1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error
> vcountPattern(mySeq, sread(A))
Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels, :
patterns with more than 20000 letters are not supported
2) vcountPattern is designed to find a motif (small) contained in a
genome (large), like this
vcountPattern("GCCACCAGGGGGCGC", Mmusculus)
In my case, I have millions of motifs (the 36 bp tags) that I have to
find if they are contained in my single ~40kb. Its like a reverse
scenario. So, if I try reversing the arguments, I also get an error:
> vcountPattern(sread(A), mySeq)
Error in normargPattern(pattern, subject) :
'pattern' must be a single string or an XString object
Any more suggestions?
Thank you,
Ivan
> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] annotate_1.27.1 AnnotationDbi_1.11.4 Biobase_2.9.0
ShortRead_1.7.9
[5] Rsamtools_1.1.8 lattice_0.18-8 Biostrings_2.17.24
GenomicRanges_1.1.17
[9] IRanges_1.7.12
loaded via a namespace (and not attached):
[1] DBI_0.2-5 grid_2.12.0 hwriter_1.2 RSQLite_0.9-1 xtable_1.5-6
an
More information about the Bioc-sig-sequencing
mailing list