[Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)
Patrick Aboyoun
paboyoun at fhcrc.org
Fri Jul 23 16:09:32 CEST 2010
Actually Biostrings::vcountPattern would be better. Then you can just
find the non-zero elements.
Patrick
On 7/23/10 7:06 AM, Patrick Aboyoun wrote:
> Ivan,
> Have you tried Biostrings::vmatchPattern?
>
>
> Patrick
>
>
> On 7/23/10 7:02 AM, Ivan Gregoretti wrote:
>> Hello listers,
>>
>> How do you count how many of your Illumina tags match a particular
>> sequence?
>>
>>
>>
>> Well, actually, I know at least two ways of doing this but they are
>> both slow. This is perhaps the fastest one:
>>
>> library(ShortRead)
>>
>> # my sequence of interest (in my real case it is several kbs)
>> mySeq<-
>> 'AGTGAAGAAGCAAAGAGGCCTCTGCAAGTCACTCAAGAAGTCTACGATTTACATAGCTGACATA'
>>
>> # load my tags keeping only the ones that pass the filter
>> A<- readAligned("/my/path", "s_5_export.txt.gz", "SolexaExport",
>> filter=compose(alignDataFilter(expression(filtering=="Y"))))
>>
>> # keep only the tags that do not contain Ns
>> A<- clean(A)
>>
>> # count how many tags match exactly some portion of mySeq
>> sum_A<- sum(unlist(sapply(sread(A), function(x)grep(x, mySeq))))
>>
>>
>> This has been running since yesterday and a similar strategy using
>> srdistance has been running for 4 days. Can anybody suggest a more
>> time-efficient calculation?
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list