[Rd] reshape scaling with large numbers of times/rows
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu Aug 24 16:00:13 CEST 2006
Here's the essence of a solution (0.14 sec for this bit)
res <- with(betterTest, {
subjects <- levels(subject)
loci <- levels(locus)
## replace "" by as.character(NA) if you prefer
res <- matrix("", length(subjects), length(loci),
dimnames = list(subjects, loci))
ind <- cbind(as.integer(subject), as.integer(locus))
res[ind] <- as.character(genotype)
res
})
This produces a character matrix, mainly because that is what I had
before. However, I think a matrix is probably a better data structure
than a wide dataframe, but it can easily be converted (which will take a
little longer, 1.5s if you use as.data.frame but there are much faster
ways). It is certainly possible to do the same thing with a factor matrix
result, but there seems to be a problem with the subsetting method for
such objects.
I would suggest generally avoiding data frames where all the columns are
of one type and you are concerned with efficiency.
On Thu, 24 Aug 2006, Mitch Skinner wrote:
> I'd like to thank everyone that's replied so far--more inline:
>
> On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote:
> > Your example does not correspond to your description. You have taken a
> > random number of loci for each subject and measured each a random number
> > of times:
>
> You're right. I was trying to come up with an example that didn't
> require sending out a big hunk of data. The overall number of
> rows/columns and the data types/sizes in the example were true to life
> but the relationship between columns was not. Also, in my testing the
> run time of the random example was pretty close to (actually faster
> than) the run time on my real data.
>
> In the real data, there's about one row per subject/locus pair (some
> combinations are missing). The genotype data does have character type;
> I'd have to think a bit to see if I could make it into an integer
> vector. Aside from just making it a factor, of course.
>
> Thanks to Gabor Grothendieck for demonstrating gl():
>
> > betterTest=data.frame(subject=as.character(1:70),
> locus=as.character(gl(4500, 70)),
> genotype=as.character(as.integer(runif(4500*70, 1, 20))))
> > sapply(betterTest, is.factor)
> subject locus genotype
> TRUE TRUE TRUE
> > system.time(wideTest <- reshape(betterTest, v.names="genotype",
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1356.209 178.867 2071.640 0.000 0.000
> > dim(wideTest)
> [1] 70 4501
> > dim(betterTest)
> [1] 315000 3
>
> This was on a different machine (a 2.2 Ghz Athlon 64). The only
> difference I can think of between betterTest and my actual data is that
> betterTest is ordered.
>
> > Also, subject and locus are archetypal factors, and forcing them to be
> > character vectors is just making efficiency problems for yourself.
>
> Hmmmm, that's the way they're coming out of the database. I'm using
> RdbiPgSQL from Bioconductor, and I assumed there was a reason why the
> database interface wasn't turning things into factors. Given my (low)
> level of R knowledge, I'd have to think for a while to convince myself
> that doing so wouldn't make a difference aside from being faster. Of
> course, if you're asserting that that's the case I'll take your word for
> it.
>
> > I have an R-level solution that takes 0.2 s on my machine, and involves no
> > changes to R.
> >
> > However, you did not give your affiliation and I do not like giving free
> > consultancy to undisclosed commercial organizations. Please in future use
> > a proper signature block so that helpers are aware of your provenance.
>
> Ah, I hadn't really thought about this, but I see where you're coming
> from. I work here (my name and this email address are on the page):
> http://egcrc.org/pis/white-c.htm
> Please forgive my r-devel-newbieness; this is less of an issue on the
> other mailing lists I follow.
>
> When there's a chance (however slim, in this case) that something I
> write will end up getting used by someone else, I usually use my
> personal email address and general identity, because I know it'll follow
> me if I change jobs. The concern, of course, being that someone using
> it will want to get in touch with me sometime in the far future. I
> don't exactly have a tenured position.
>
> I really am trying to give at least as much as I'm taking; hopefully my
> first email shows that I did a healthy bit of
> thinking/reading/googling/coding before posting (maybe too much).
> Apparently the c-solution isn't necessary, but doing this in 0.2s is
> pretty amazing. On the same size data frame?
>
> Thanks,
> Mitch Skinner Tel: 510-985-3192
> Programmer/Analyst
> Ernest Gallo Clinic & Research Center
> University of California, San Francisco
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list