[R] Maximum number of patterns and speed in grep
mdvaan
mathijsdevaan at gmail.com
Mon Jul 16 17:27:02 CEST 2012
Thanks! That worked like a charm.
Math
Gabor Grothendieck wrote
>
> On Fri, Jul 13, 2012 at 1:41 PM, mdvaan <mathijsdevaan@> wrote:
>> Here's some data (which should give you the error messages):
>>
>> # read in data
>> data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header
>> =
>> T, sep = ",")
>>
>> # first paste all data
>> data1 <- paste(data[,1], collapse = "|")
>>
>> # second paste subsets of the data
>> data2a <- paste(data[1:750,1], collapse = "|")
>> data2b <- paste(data[751:1500,1], collapse = "|")
>>
>> # define the object to be searched
>> text <- c("the first is Santa Fe Gold Corp", "the second is
>> Starpharma
>> Holdings")
>>
>> # match
>> strapplyc(text, data1)
>> strapplyc(text, data2a)
>> strapplyc(text, data2b)
>>
>> Thanks in advance!
>>
>
> Although it seems that strapplyc can handle larger regular expressions
> than grep in R it seems neither can handle as many as in your example
> so process it in chunks:
>
> k <- 3000 # chunk size
>
> f <- function(from, text) {
> to <- min(from + k - 1, nrow(data))
> r <- paste(data[seq(from, to), 1], collapse = "|")
> r <- gsub("[().*?+{}]", "", r)
> strapply(text, r)
> }
> ix <- seq(1, nrow(data), k)
> out <- lapply(text, function(text) unlist(lapply(ix, f, text)))
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636657.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list