[R] Maximum number of patterns and speed in grep
mdvaan
mathijsdevaan at gmail.com
Mon Jul 23 17:17:39 CEST 2012
Hi,
I have a minor follow-up question:
In the example below, "ann" and "nn" in the third element of text are
matched. I would like to ignore all matches in which the character following
the match is one of [:alpha:]. How do I do this without removing the
"ignore.case = TRUE" argument of the strapply function?
So the output should be:
[[1]]
[1] "Santa Fe Gold Corp"
[[2]]
[1] "Starpharma Holdings"
[[3]]
NULL
Rather than:
[[1]]
[1] "Santa Fe Gold Corp"
[[2]]
[1] "Starpharma Holdings"
[[3]]
[1] "ann" "nn"
Thanks!
require(gsubfn)
# read in data
data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header = T,
sep = ",")
# define the object to be searched
text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
Holdings", "the annual earnings exceed those of last year")
k <- 3000 # chunk size
f <- function(from, text) {
to <- min(from + k - 1, nrow(data))
r <- paste(data[seq(from, to), 1], collapse = "|")
r <- gsub("[().*?+{}]", "", r)
strapply(text, r, ignore.case = TRUE)
}
ix <- seq(1, nrow(data), k)
out <- lapply(text, function(text) unlist(lapply(ix, f, text)))
--
View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4637458.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list