[Bioc-sig-seq] matchPDict, fixed=FALSE; "walk_tb_nonfixed_subject(): implement me"
Hervé Pagès
hpages at fhcrc.org
Fri Jun 25 20:03:16 CEST 2010
Hi Ludo,
Yes matchPDict() used to support fixed=FALSE. It still does, but only
when the PDict object is made using the old implementation of the
Aho-Corasick algo ('algo="ACtree"'):
> pdict <- PDict(c("ACCT", "GACC", "CCCT", "CCCA"), algo="ACtree")
> matchPDict(pdict, DNAString("GNCCT"), fixed="pattern")[[3]]
IRanges of length 1
start end width
[1] 2 5 4
The "ACtree" algo has been superseded by the "ACtree2" algo, a faster
and more memory efficient implementation of the same algo that uses a
different internal representation than "ACtree" for the Aho-Corasick
tree.
The 'fixed=TRUE' (or 'fixed="pattern"') option is not yet supported
for PDict objects built with the new algo. I'll add this ASAP. Thanks
for the reminder!
Cheers,
H.
On 06/25/2010 03:46 AM, Ludo Pagie wrote:
>
> hi all,
>
> I'm trying to match 80bp reads to a construct, a sequence of +/-
> 550bp. The construct contains a strecth of N's, representing a
> stretch of 20 random nucleotides.
>
> I constructed a pdict from the reads, and a DNAString from the
> construct. When I run matchPDict with fixed=TRUE, all goes fine
> and I get 1.2M matches.
>
>> construct_mindex<- matchPDict(pdict, DNAString(construct), max.mismatch=3)
>> sum(countIndex(construct_mindex))
> [1] 1280283
>
>
> With fixed=FALSE I get the following error:
>
>> construct_mindex<- matchPDict(pdict, DNAString(construct), max.mismatch=3, fixed=FALSE)
> Error in .match.PDict3Parts.XString(pdict at threeparts, subject, max.mismatch, :
> walk_tb_nonfixed_subject(): implement me
>
> Is there a way around this non-implemented function? Or any
> chance it will be implemented soon? Or am I missing something.
>
> If you need more background let me know.
>
> Ludo
>
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-06-17
> r52313)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] ShortRead_1.7.7 Rsamtools_1.1.7
> lattice_0.18-8
> [4] GenomicRanges_1.1.12 Biostrings_2.17.7 IRanges_1.7.7
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0 grid_2.12.0 hwriter_1.2 tools_2.12.0
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list