[Rd] reshape scaling with large numbers of times/rows

Thu Aug 24 12:16:58 CEST 2006

Your example does not correspond to your description.  You have taken a 
random number of loci for each subject and measured each a random number 
of times:

> with(test, table(table(subject, locus)))

     0      1      2      3      4      5      6      7      8
118021 114340  54963  17848   4288    814    136     16      5

Why would you want to make that into a wide data frame?  'testWide' only 
contains a subset of the original data frame since you are violating the 
assumptions of reshape().

Also, subject and locus are archetypal factors, and forcing them to be 
character vectors is just making efficiency problems for yourself.

I have an R-level solution that takes 0.2 s on my machine, and involves no 
changes to R.

However, you did not give your affiliation and I do not like giving free 
consultancy to undisclosed commercial organizations.  Please in future use 
a proper signature block so that helpers are aware of your provenance.

On Wed, 23 Aug 2006, Mitch Skinner wrote:

> Hello,
> 
> I'm mailing r-devel because I think the performance problem I'm having
> is best solved by re-writing reshape() in C.  I've been reading the
> "writing R extensions" documentation and I have some questions about
> they best way to write the C bits, but first let me describe my problem:
> 
> I'm trying to reshape a long data frame with ~70 subjects measured at
> ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> a wide data frame with one column per locus ("time").  On my data
> (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> 3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
> close to how my data actually looks):
> 
> > version
>                _
> platform       i686-redhat-linux-gnu
> arch           i686
> os             linux-gnu
> system         i686, linux-gnu
> status
> major          2
> minor          3.1
> year           2006
> month          06
> day            01
> svn rev        38247
> language       R
> version.string Version 2.3.1 (2006-06-01)
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1107.506  120.755 1267.568    0.000    0.000
> 
> I believe that this can be done a lot faster, and I think the problem
> comes from this loop in reshape.R (in reshapeWide):
> 
>         for(i in seq(length = length(times))) {
>             thistime <- data[data[,timevar] %in% times[i],]
>             rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> v.names]
>         }
> 
> I don't really understand everything going on under the hood here, but I
> believe the complexity of this loop is something like
> O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
> shows the lion's share (90%) of the time being spent in [.data.frame.
> 
> What I'd like to do is write a C loop to go through the source (long)
> data frame once (rather than once per time), and put the values into the
> destination rows/columns using hash tables to look up the right
> row/column.
> 
> Assuming the hashing is constant-time bounded, then the reshape becomes
> O(nrow(data)*length(varying)).
> 
> I'd like to use the abitrary-R-vector hashing functionality from
> unique.c, but I'm not sure how to do it.  I think that functionality is
> useful to other C code, but the functions that I'm interested in are not
> part of the R api (they're defined with "static").  Assuming copy/paste
> is out, I can see two options: 1. to remove "static" from the
> declarations of some of the functions, and add prototypes for those
> functions to a new src/library/stats/src/reshape.c file (or to one of
> the header files in src/include), or 2. to add C functions to do the
> reshaping to src/main/unique.c and call those from
> src/library/stats/R/reshape.R.
> 
> This is all assuming that it makes sense to include this in mainline
> R--obviously I think it's worthwhile, and I'm surprised other people
> aren't complaining more.  I would be happy to write/rewrite until y'all
> are happy with how it's done, of course.
> 
> I've written a proof-of-concept by copying and pasting the hashing
> functions, which (on the above data frame) runs 20 times faster than the
> R version of reshape.  It still needs some debugging, to put it mildly
> (doesn't work properly on reals), but the basic idea appears to work.
> 
> The change I made to the reshape R function looks like this:
> =====================
>          for(i in seq(length = length(times))) {
> -            thistime <- data[data[,timevar] %in% times[i],]
> -            rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> -                                           v.names]
> +          for (j in seq(length(v.names))) {
> +            rval[,varying[j,i]] <-
> rep(vector(mode=typeof(data[[v.names[j]]]), 0),
> +                                       length.out=nrow(rval))
> +          }
>          }
> 
> +        colMap <- match(varying, names(rval))
> +        .Call("do_reshape_wide",
> +              data[[idvar]], data[[timevar]], data[v.names],
> +              rval, colMap,
> +              v.names, times, rval[[idvar]])
> +
>          if (!is.null(new.row.names))
>              row.names(rval) <- new.row.names
> =====================
> 
> This part:
> rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
> is to initialize the output with appropriately-typed vectors full of
> NAs; if there's a better/right way to do this please let me know.
> 
> The do_reshape_wide C function looks like this:
> =====================
> SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData, 
>                      SEXP wideFrame, SEXP colMap,
>                      SEXP vnames, SEXP times, SEXP wideIds) {
>     HashData idHash, timeHash;
>     int longRows, numVarying;
>     int rowNum, varying;
>     int wideRow, curTime;
>     SEXP wideCol, longCol;
>     
>     HashTableSetup(wideIds, &idHash);
>     PROTECT(idHash.HashTable);
>     DoHashing(wideIds, &idHash);
> 
>     HashTableSetup(times, &timeHash);
>     PROTECT(timeHash.HashTable);
>     DoHashing(times, &timeHash);
> 
>     longRows = length(longIds);
>     numVarying = length(vnames);
> 
>     for (rowNum = 0; rowNum < longRows; rowNum++) {
>         /* Lookup returns 1-based answers */
>         wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1;
>         curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1;
>         
>         for (varying = 0; varying < numVarying; varying++) {
>             /* colMap is 1-based */
>             wideCol = VECTOR_ELT(wideFrame, 
>                                  INTEGER(colMap)[(numVarying * curTime)
>                                                  + varying] - 1);
>             longCol = VECTOR_ELT(longData, varying);
> 
>             SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol,
> rowNum));
>         }
>     }
>     UNPROTECT(2);
> }
> =====================
> 
> None examples I recall from "Writing R extensions" had void C function
> return types; is that a rule?
> 
> I've put the code up here:
> http://arctur.us/r/creshape.tar.gz
> 
> I've never tried to make a package before, but you should be able to
> just do a R CMD INSTALL on it, if you want to try it out.  Here's an R
> session using that code:
> > require(creshape)
> Loading required package: creshape
> [1] TRUE
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- creshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 60.756  1.540 62.598  0.000  0.000
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1278.231   78.389 1387.739    0.000    0.000
> 
> Any comments are appreciated,
> Mitch
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595