[R] Maximum number of patterns and speed in grep

mdvaan mathijsdevaan at gmail.com
Mon Jul 16 17:27:02 CEST 2012


Thanks! That worked like a charm.

Math


Gabor Grothendieck wrote
> 
> On Fri, Jul 13, 2012 at 1:41 PM, mdvaan <mathijsdevaan@> wrote:
>> Here's some data (which should give you the error messages):
>>
>>     # read in data
>>     data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header
>> =
>> T, sep = ",")
>>
>>     # first paste all data
>>     data1 <- paste(data[,1], collapse = "|")
>>
>>     # second paste subsets of the data
>>     data2a <- paste(data[1:750,1], collapse = "|")
>>     data2b <- paste(data[751:1500,1], collapse = "|")
>>
>>     # define the object to be searched
>>     text <- c("the first is Santa Fe Gold Corp", "the second is
>> Starpharma
>> Holdings")
>>
>>     # match
>>     strapplyc(text, data1)
>>     strapplyc(text, data2a)
>>     strapplyc(text, data2b)
>>
>> Thanks in advance!
>>
> 
> Although it seems that strapplyc can handle larger regular expressions
> than grep in R it seems neither can handle as many as in your example
> so process it in chunks:
> 
> k <- 3000 # chunk size
> 
> f <- function(from, text) {
> 	to <- min(from + k - 1, nrow(data))
> 	r <- paste(data[seq(from, to), 1], collapse = "|")
> 	r <- gsub("[().*?+{}]", "", r)
> 	strapply(text, r)
> }
> ix <- seq(1, nrow(data), k)
> out <- lapply(text, function(text) unlist(lapply(ix, f, text)))
> 
> 
> -- 
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> 
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


--
View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636657.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list