[R] How should I improve the following R code?

Seung Jun seungwjun at gmail.com
Tue Jan 8 00:49:33 CET 2008


I'm looking for a way to improve code that's proven to be inefficient.

Suppose that a data source generates the following table every minute:

  Index  Count
  ------------
  0      234
  1      120
  7      11
  30     1

I save the tables in the following CSV format:

  time,index,count
  0,0:1:7:30,234:120:11:1
  1,0:2:3:19,199:110:87:9

That is, each line represents a table, and I have N lines for N minutes of
data collection.

Now, I wrote the following code to get quantiles for each time period:

  library(Hmisc)
  stbl  <- read.csv("data.csv")
  index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric)
  count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric)
  len   <- length(index)
  for (i in 1:len) {
    v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1))
    stbl$q0[i] <- v[1]
    stbl$q2[i] <- v[2]
    stbl$q5[i] <- v[3]
    stbl$q8[i] <- v[4]
    stbl$q10[i] <- v[5]
  }

It works fine for a small N, but it get quickly inefficient as N grows.  The
for-loop takes too long.  How could I improve the code or data
representation so it can run fast?

Thanks,
Seung




More information about the R-help mailing list