[Rd] reshape scaling with large numbers of times/rows

Thu Aug 24 16:00:13 CEST 2006

Here's the essence of a solution (0.14 sec for this bit)

res <- with(betterTest, {
subjects <- levels(subject)
loci <- levels(locus)
## replace "" by as.character(NA) if you prefer
res <- matrix("", length(subjects), length(loci),
              dimnames = list(subjects, loci))
ind <- cbind(as.integer(subject), as.integer(locus))
res[ind] <- as.character(genotype)
res
})

This produces a character matrix, mainly because that is what I had 
before.  However, I think a matrix is probably a better data structure 
than a wide dataframe, but it can easily be converted (which will take a 
little longer, 1.5s if you use as.data.frame but there are much faster 
ways).  It is certainly possible to do the same thing with a factor matrix 
result, but there seems to be a problem with the subsetting method for 
such objects.

I would suggest generally avoiding data frames where all the columns are 
of one type and you are concerned with efficiency.

On Thu, 24 Aug 2006, Mitch Skinner wrote:

> I'd like to thank everyone that's replied so far--more inline:
> 
> On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote:
> > Your example does not correspond to your description.  You have taken a 
> > random number of loci for each subject and measured each a random number 
> > of times:
> 
> You're right.  I was trying to come up with an example that didn't
> require sending out a big hunk of data.  The overall number of
> rows/columns and the data types/sizes in the example were true to life
> but the relationship between columns was not.  Also, in my testing the
> run time of the random example was pretty close to (actually faster
> than) the run time on my real data.
> 
> In the real data, there's about one row per subject/locus pair (some
> combinations are missing).  The genotype data does have character type;
> I'd have to think a bit to see if I could make it into an integer
> vector.  Aside from just making it a factor, of course.
> 
> Thanks to Gabor Grothendieck for demonstrating gl():
> 
> > betterTest=data.frame(subject=as.character(1:70),
> locus=as.character(gl(4500, 70)),
> genotype=as.character(as.integer(runif(4500*70, 1, 20))))
> > sapply(betterTest, is.factor)
>   subject    locus genotype
>     TRUE     TRUE     TRUE
> > system.time(wideTest <- reshape(betterTest, v.names="genotype",
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1356.209  178.867 2071.640    0.000    0.000
> > dim(wideTest)
> [1]   70 4501
> > dim(betterTest)
> [1] 315000      3
> 
> This was on a different machine (a 2.2 Ghz Athlon 64).  The only
> difference I can think of between betterTest and my actual data is that
> betterTest is ordered.
> 
> > Also, subject and locus are archetypal factors, and forcing them to be 
> > character vectors is just making efficiency problems for yourself.
> 
> Hmmmm, that's the way they're coming out of the database.  I'm using
> RdbiPgSQL from Bioconductor, and I assumed there was a reason why the
> database interface wasn't turning things into factors.  Given my (low)
> level of R knowledge, I'd have to think for a while to convince myself
> that doing so wouldn't make a difference aside from being faster.  Of
> course, if you're asserting that that's the case I'll take your word for
> it.
> 
> > I have an R-level solution that takes 0.2 s on my machine, and involves no 
> > changes to R.
> > 
> > However, you did not give your affiliation and I do not like giving free 
> > consultancy to undisclosed commercial organizations.  Please in future use 
> > a proper signature block so that helpers are aware of your provenance.
> 
> Ah, I hadn't really thought about this, but I see where you're coming
> from.  I work here (my name and this email address are on the page):
> http://egcrc.org/pis/white-c.htm
> Please forgive my r-devel-newbieness; this is less of an issue on the
> other mailing lists I follow.
> 
> When there's a chance (however slim, in this case) that something I
> write will end up getting used by someone else, I usually use my
> personal email address and general identity, because I know it'll follow
> me if I change jobs.  The concern, of course, being that someone using
> it will want to get in touch with me sometime in the far future.  I
> don't exactly have a tenured position.
> 
> I really am trying to give at least as much as I'm taking; hopefully my
> first email shows that I did a healthy bit of
> thinking/reading/googling/coding before posting (maybe too much).
> Apparently the c-solution isn't necessary, but doing this in 0.2s is
> pretty amazing.  On the same size data frame?
> 
> Thanks,
> Mitch Skinner                            Tel: 510-985-3192
> Programmer/Analyst
> Ernest Gallo Clinic & Research Center
> University of California, San Francisco
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595