[Bioc-sig-seq] Bioc short read directions
Herve Pages
hpages at fhcrc.org
Thu Apr 3 21:46:31 CEST 2008
Hi Stephen,
Can we please keep all this discussion on the list? Thanks!
Stephen Henderson wrote:
> Hi
> I don't understand why you need to rebuild the trie??/suffix
> tree??/aho-corasick structure-- which as you say is slow.
>
> Can you not just reverse and reverse complement the reference and stream
> it past the trie again (which is faster). Everything is then 5' to 3'
> reads/trie, and reference and the trusted band is in the right place? Or
> have I misunderstood this?
I don't _need_ to rebuild the tree. Martin already suggested what you are
suggesting. Rebuilding the tree is an alternative that, in some cases, could
be faster (and more memory efficient) when the input dictionary is not too big
compared to the subject. It also has the little advantage to preserve the
chromosome locations, which will then be reported relatively to the plus strand
with both PDict(reads) and PDict(RC(reads)).
Hey, building the tree for 265400 25-mers takes about 1 second on my machine:
> length(dict0)
[1] 265400
> unique(nchar(dict0))
[1] 25
> dict0[1:5]
[1] "CCTGAATCCTGGCAATGTCATCATC" "ATCCTGGCAATGTCATCATCAATGG"
[3] "ATCAGTTGTCAACGGCTAATACGCG" "ATCAATGGCGATTGCCGCGTCTGCA"
[5] "CCGCGTCTGCAATGTGAGGGCCTAA"
> system.time({pdict1 <- PDict(dict0, tb.start=1, tb.end=12)}, gcFirst=TRUE)
user system elapsed
0.888 0.028 0.916
> object.size(pdict1)
[1] 36334128
and the size is way smaller than Human chr1!
>
> Ps Is the whole reads in the trie? Or just the first 12?
Just the first 12.
> And what is
> stored in the nodes of this
From src/Biostrings.h in Biostrings:
#define MAX_CHILDREN_PER_ACNODE 4
typedef struct acnode {
int parent_id;
int depth;
int child_id[MAX_CHILDREN_PER_ACNODE];
int flink;
int P_id;
} ACNode;
So the size of a node is 32 bytes.
I'll add some comments in the file to explain a little bit what those slots are.
You can see the nodes at the R level:
> pdict1 at actree
743661-node 4-ary Aho-Corasick tree
> pdict1 at actree[0:4]
parent_id depth 84 71 67 65 flink P_id
0 0 0 1 13 56 121 -1 -1
1 0 1 78 2 100 246 -1 -1
2 1 2 1033 3 111 46 -1 -1
3 2 3 3199 3502 4 842 -1 -1
4 3 4 5 6085 1540 5524 -1 -1
84, 71, 67 and 65 are the ASCII codes for T, G, C and A.
Note that the failure links (flink slot) are initialized with -1 and they are set
with their real values on-the-fly when the PDict object is used. So the more the
PDict object is used, the more failure links are known, and the fastest
the PDict object will be.
Cheers,
H.
More information about the Bioc-sig-sequencing
mailing list