[R] How to find frequent sequences.
Petr Savicky
savicky at cs.cas.cz
Fri Jul 13 10:36:55 CEST 2012
On Thu, Jul 12, 2012 at 03:51:54PM -0500, Vineet Shukla wrote:
> I have independent event sequences for example as follows :
>
> Independent event sequence 1 : A , B , C , D
> Independent event sequence 2 : A, C , B
> Independent event sequence 3 :D, A, B, X,Y, Z
> Independent event sequence 4 :C,A,A,B
> Independent event sequence 5 :B,A,D
>
> I want to able to find that most common sequence patters as
>
> {A, B } = > 3
> from lines 1,3,5.
>
> Pls note that A,C,B must not be considered because C comes in between
> and line 5 also must not be considered because order of A,B is reversed.
Hi.
If i understand correctly, the first sequence contains patterns
AB, BC, CD.
Using this interpretation, AB occurs at lines 1,3,4 and not 1,3,5.
Is this correct?
If some sequence contains several ocurrences of a pattern, for example,
the sequence
A, B, A, B
contains AB twice, then it is counted only once?
If this is correct, then try the following
# your input list
lst <- list(
c("A", "B", "C", "D"),
c("A", "C", "B"),
c("D", "A", "B", "X", "Y", "Z"),
c("C", "A", "A", "B"),
c("B", "A", "D"))
# extract unique patterns from a single sequence as rows of a matrix
# lpattern is the length of the patterns
singleSeq <- function(x, lpattern)
{
unique(embed(rev(x), lpattern))
}
lst1 <- lapply(lst, singleSeq, lpattern=2)
# combine the matrices to a single matrix
mat <- do.call(rbind, lst1)
# convert the patters to strings
pat <- do.call(paste, c(data.frame(mat), sep=""))
out <- table(pat)
out
pat
AA AB AC AD BA BC BX CA CB CD DA XY YZ
1 3 1 1 1 1 1 1 1 1 1 1 1
names(out)[which.max(out)]
[1] "AB"
Hope this helps.
Petr Savicky.
More information about the R-help
mailing list