[Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)

Fri Jul 23 17:52:10 CEST 2010

Ivan,
Sorry for the evolving answer, but this may prove to be faster

length(whichPDict(PDict(sread(A)), as(mySeq, "DNAString")))


Patrick


On 7/23/10 8:28 AM, Ivan Gregoretti wrote:
> It works. It produces and answer in under 2 minutes. I will flesh it
> out a bit for posterity.
>
> Some slight modifications must be applied. First, mySeq cannot be of
> class character. So, it works if I do
>
> countPDict(PDict(sread(A)), as(mySeq, "DNAString"))
>
> Now, that function outputs how many times a tag is found in mySeq. To
> compute how many tags match mySeq once or more, I have to do
>
> sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0)
>
>
> By the way, this could have been done with perl or python or any other
> tools. However, it helps to learn how to do it efficiently from within
> the Bioconductor.
>
> Thank you.
>
> Ivan
>
>
>
> On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun<paboyoun at fhcrc.org>  wrote:
>    
>> Ivan,
>> How about
>>
>> countPDict(PDict(sread(A)), mySeq)
>>
>>
>> Patrick
>>
>>
>> On 7/23/10 7:45 AM, Ivan Gregoretti wrote:
>>      
>>> Hello Patrick,
>>>
>>> The idea of vcountPattern is good but it does not quite work for two
>>> reasons
>>>
>>> 1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error
>>>
>>>
>>>        
>>>> vcountPattern(mySeq, sread(A))
>>>>
>>>>          
>>> Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels,  :
>>>    patterns with more than 20000 letters are not supported
>>>
>>> 2) vcountPattern is designed to find a motif (small) contained in a
>>> genome (large), like this
>>> vcountPattern("GCCACCAGGGGGCGC", Mmusculus)
>>>
>>> In my case, I have millions of motifs (the 36 bp tags) that I have to
>>> find if they are contained in my single ~40kb. Its like a reverse
>>> scenario. So, if I try reversing the arguments, I also get an error:
>>>
>>>
>>>        
>>>> vcountPattern(sread(A), mySeq)
>>>>
>>>>          
>>> Error in normargPattern(pattern, subject) :
>>>    'pattern' must be a single string or an XString object
>>>
>>> Any more suggestions?
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>>
>>>        
>>>> sessionInfo()
>>>>
>>>>          
>>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> LC_TIME=en_US.UTF-8
>>>   [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>>> LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>>> LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] annotate_1.27.1      AnnotationDbi_1.11.4 Biobase_2.9.0
>>> ShortRead_1.7.9
>>> [5] Rsamtools_1.1.8      lattice_0.18-8       Biostrings_2.17.24
>>> GenomicRanges_1.1.17
>>> [9] IRanges_1.7.12
>>>
>>> loaded via a namespace (and not attached):
>>> [1] DBI_0.2-5     grid_2.12.0   hwriter_1.2   RSQLite_0.9-1 xtable_1.5-6
>>> an
>>>
>>>        
>>
>>