[Rd] reshape scaling with large numbers of times/rows
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu Aug 24 12:16:58 CEST 2006
Your example does not correspond to your description. You have taken a
random number of loci for each subject and measured each a random number
of times:
> with(test, table(table(subject, locus)))
0 1 2 3 4 5 6 7 8
118021 114340 54963 17848 4288 814 136 16 5
Why would you want to make that into a wide data frame? 'testWide' only
contains a subset of the original data frame since you are violating the
assumptions of reshape().
Also, subject and locus are archetypal factors, and forcing them to be
character vectors is just making efficiency problems for yourself.
I have an R-level solution that takes 0.2 s on my machine, and involves no
changes to R.
However, you did not give your affiliation and I do not like giving free
consultancy to undisclosed commercial organizations. Please in future use
a proper signature block so that helpers are aware of your provenance.
On Wed, 23 Aug 2006, Mitch Skinner wrote:
> Hello,
>
> I'm mailing r-devel because I think the performance problem I'm having
> is best solved by re-writing reshape() in C. I've been reading the
> "writing R extensions" documentation and I have some questions about
> they best way to write the C bits, but first let me describe my problem:
>
> I'm trying to reshape a long data frame with ~70 subjects measured at
> ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> a wide data frame with one column per locus ("time"). On my data
> (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> 3.4GHz P4. Here's an R session that demonstrates it (this is pretty
> close to how my data actually looks):
>
> > version
> _
> platform i686-redhat-linux-gnu
> arch i686
> os linux-gnu
> system i686, linux-gnu
> status
> major 2
> minor 3.1
> year 2006
> month 06
> day 01
> svn rev 38247
> language R
> version.string Version 2.3.1 (2006-06-01)
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1107.506 120.755 1267.568 0.000 0.000
>
> I believe that this can be done a lot faster, and I think the problem
> comes from this loop in reshape.R (in reshapeWide):
>
> for(i in seq(length = length(times))) {
> thistime <- data[data[,timevar] %in% times[i],]
> rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> v.names]
> }
>
> I don't really understand everything going on under the hood here, but I
> believe the complexity of this loop is something like
> O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile
> shows the lion's share (90%) of the time being spent in [.data.frame.
>
> What I'd like to do is write a C loop to go through the source (long)
> data frame once (rather than once per time), and put the values into the
> destination rows/columns using hash tables to look up the right
> row/column.
>
> Assuming the hashing is constant-time bounded, then the reshape becomes
> O(nrow(data)*length(varying)).
>
> I'd like to use the abitrary-R-vector hashing functionality from
> unique.c, but I'm not sure how to do it. I think that functionality is
> useful to other C code, but the functions that I'm interested in are not
> part of the R api (they're defined with "static"). Assuming copy/paste
> is out, I can see two options: 1. to remove "static" from the
> declarations of some of the functions, and add prototypes for those
> functions to a new src/library/stats/src/reshape.c file (or to one of
> the header files in src/include), or 2. to add C functions to do the
> reshaping to src/main/unique.c and call those from
> src/library/stats/R/reshape.R.
>
> This is all assuming that it makes sense to include this in mainline
> R--obviously I think it's worthwhile, and I'm surprised other people
> aren't complaining more. I would be happy to write/rewrite until y'all
> are happy with how it's done, of course.
>
> I've written a proof-of-concept by copying and pasting the hashing
> functions, which (on the above data frame) runs 20 times faster than the
> R version of reshape. It still needs some debugging, to put it mildly
> (doesn't work properly on reals), but the basic idea appears to work.
>
> The change I made to the reshape R function looks like this:
> =====================
> for(i in seq(length = length(times))) {
> - thistime <- data[data[,timevar] %in% times[i],]
> - rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> - v.names]
> + for (j in seq(length(v.names))) {
> + rval[,varying[j,i]] <-
> rep(vector(mode=typeof(data[[v.names[j]]]), 0),
> + length.out=nrow(rval))
> + }
> }
>
> + colMap <- match(varying, names(rval))
> + .Call("do_reshape_wide",
> + data[[idvar]], data[[timevar]], data[v.names],
> + rval, colMap,
> + v.names, times, rval[[idvar]])
> +
> if (!is.null(new.row.names))
> row.names(rval) <- new.row.names
> =====================
>
> This part:
> rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
> is to initialize the output with appropriately-typed vectors full of
> NAs; if there's a better/right way to do this please let me know.
>
> The do_reshape_wide C function looks like this:
> =====================
> SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData,
> SEXP wideFrame, SEXP colMap,
> SEXP vnames, SEXP times, SEXP wideIds) {
> HashData idHash, timeHash;
> int longRows, numVarying;
> int rowNum, varying;
> int wideRow, curTime;
> SEXP wideCol, longCol;
>
> HashTableSetup(wideIds, &idHash);
> PROTECT(idHash.HashTable);
> DoHashing(wideIds, &idHash);
>
> HashTableSetup(times, &timeHash);
> PROTECT(timeHash.HashTable);
> DoHashing(times, &timeHash);
>
> longRows = length(longIds);
> numVarying = length(vnames);
>
> for (rowNum = 0; rowNum < longRows; rowNum++) {
> /* Lookup returns 1-based answers */
> wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1;
> curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1;
>
> for (varying = 0; varying < numVarying; varying++) {
> /* colMap is 1-based */
> wideCol = VECTOR_ELT(wideFrame,
> INTEGER(colMap)[(numVarying * curTime)
> + varying] - 1);
> longCol = VECTOR_ELT(longData, varying);
>
> SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol,
> rowNum));
> }
> }
> UNPROTECT(2);
> }
> =====================
>
> None examples I recall from "Writing R extensions" had void C function
> return types; is that a rule?
>
> I've put the code up here:
> http://arctur.us/r/creshape.tar.gz
>
> I've never tried to make a package before, but you should be able to
> just do a R CMD INSTALL on it, if you want to try it out. Here's an R
> session using that code:
> > require(creshape)
> Loading required package: creshape
> [1] TRUE
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- creshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 60.756 1.540 62.598 0.000 0.000
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1278.231 78.389 1387.739 0.000 0.000
>
> Any comments are appreciated,
> Mitch
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list