[R-sig-ME] alternative prediction methods for MCMCglmm

Sun Jan 31 06:04:24 CET 2016

I have a data set with 190k observations, over 100 predictors (including interactions), and 4 groups, each ranging from 100 to 1000 members.  

Lme4 models it in about 3 minutes, but lme4 is of course fast at most such things because it is able to work from maximum likelihood.  

What I'd prefer to use is MCMC.  Stan is overwhelmed by such a model, but MCMCglmm, to its credit, can model its standard 13000 iterations in about 2 hours.  (I suspect MCMCglmm's block sampling is well suited to working with large groups).  The sampling is relatively stable after the 3000 burnin iterations, so I have set thin=1 for the last 10000 iterations to reach effective sample sizes in a reasonable amount of time.  This gives me 10000 saved iterations over the 190k observations.

The problem is that while the model runs, the data is saved, and the parameter values from the model summary look good, the amount of saved data overwhelms the built-in predict function when I try to use posterior = 'all'.  Whether I use a machine with 8 or 64 GB of RAM, on Windows 10 or Linux, a prediction from this model uses up virtually all of the available memory and then pronounces that it cannot fit a vector of ___ size (it varies).  Sometimes the model refuses to do predictions at all; sometimes it will do predictions but chokes on doing intervals along with predictions, and sometimes it will do predictions with no marginalizations but object when I attempt to marginalize only certain groups.  There is no problem if I marginalize off a particular point (mean, median, etc.) but for me that defeats much of the point of using (and waiting for) MCMC.  

In reviewing the traceback on the failed predictions, it seems like the process is choking on the calls the prediction function makes to coda and its calculation of the HDR, as called through the apply function.  I recall Jerrod saying that he considered the predict function to be somewhat crude, and wonder therefore if anyone can suggest either (1) revisions to the predict function that might cut down on memory usage, or (2) alternative / manual ways to marginalize over the predictors in the MCMCglmm fit that would not leak / absorb so much memory. I doubt that I am the first person to face this challenge.  

A dummy set of data is provided below that will substantially replicate the problem.  I am using the version of lme4 and MCMCglmm  2.22.1, R 3.2.3.  

Thanks very much.

Jonathan

***********************************************************

### Prepare Environment ###

rm(list=ls(all=TRUE))

set.seed(1234)

### Create vectors and df ###

  y <- scale(c(rep(0,125000), rep(.4,5000), rep(.65, 45000), rep(1.1,10000), rep(1.6, 5000)), center=T, scale=T)
  b <- as.factor(c(rep(1,100000), rep(2,90000)))
  c <- as.factor(c(rep_len(1:30, 190000)))
  d <- as.factor(c(rep(1,60000), rep(2,50000), rep(3, 40000), rep(4,40000)))
  e <- scale(rnorm(190000,0,1), center=T, scale=T)
  f <- scale(rnorm(190000,0,1), center=T, scale=T)
  g <- scale(log(rpois(190000,100)), center=T, scale=T)
  h <- as.factor(c(rep(1,25000), rep(2,22000), rep(3,26000), rep(4,30000), rep(5,17000), rep(6,15000), rep(7,30000), rep(8,25000)))
  j <- as.factor(rpois(190000,20000))
  k <- as.factor(rpois(190000,7500))
  m <- as.factor(rpois(190000,150))

df <- data.frame(y,b,c,d,e,f,g,h,j,k,m)

### lme4 fit ###

require(lme4)
fit.lme4 <- lmer(y ~ b*c + 
              d*e + 
              f + 
              g +
              e +
              h +
              (1|j) + 
              (1|k) + 
              (1|m),
              data=df)

### MCMCglmm fit ###
require(MCMCglmm)
fit.MCMCglmm <- MCMCglmm(y ~ b*c + 
                           d*e + 
                           f + 
                           g +
                           e +
                           h,
                           random = 
                           ~j + 
                           k + 
                           m,
                         data=df,
                         pr=T,
                         thin=1)

******************************************************

In terms of error codes, they are not always the same, but here is the traceback from an attempt to predict the response with a confidence interval:

> df$preds <- predict(fit.MCMCglmm, type='response', interval='confidence', posterior='all')
There were 14 warnings (use warnings() to see them)
Error in t(apply(object$Sol, 1, function(x) { : 
  error in evaluating the argument 'x' in selecting a method for function 't': Error: cannot allocate vector of size 14.2 Gb
> traceback()
3: t(apply(object$Sol, 1, function(x) {
       (W %*% x)@x
   }))
2: predict.MCMCglmm(fit.MCMCglmm, type = "response", interval = "confidence", 
       posterior = "all")
1: predict(fit.MCMCglmm, type = "response", interval = "confidence", 
       posterior = "all")