[R] Maximum number of patterns and speed in grep
Gabor Grothendieck
ggrothendieck at gmail.com
Sun Jul 15 14:27:31 CEST 2012
On Fri, Jul 13, 2012 at 1:41 PM, mdvaan <mathijsdevaan at gmail.com> wrote:
> Here's some data (which should give you the error messages):
>
> # read in data
> data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header =
> T, sep = ",")
>
> # first paste all data
> data1 <- paste(data[,1], collapse = "|")
>
> # second paste subsets of the data
> data2a <- paste(data[1:750,1], collapse = "|")
> data2b <- paste(data[751:1500,1], collapse = "|")
>
> # define the object to be searched
> text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
> Holdings")
>
> # match
> strapplyc(text, data1)
> strapplyc(text, data2a)
> strapplyc(text, data2b)
>
> Thanks in advance!
>
Although it seems that strapplyc can handle larger regular expressions
than grep in R it seems neither can handle as many as in your example
so process it in chunks:
k <- 3000 # chunk size
f <- function(from, text) {
to <- min(from + k - 1, nrow(data))
r <- paste(data[seq(from, to), 1], collapse = "|")
r <- gsub("[().*?+{}]", "", r)
strapply(text, r)
}
ix <- seq(1, nrow(data), k)
out <- lapply(text, function(text) unlist(lapply(ix, f, text)))
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list