[R-sig-ME] Errors message from glmmADMB package: Zero Inflated, Negative Binomial for large panel dataset

Tue Sep 9 20:16:20 CEST 2014

On 14-09-07 09:28 PM, Leo Yanes wrote:
> Colleagues,
> 
> I have a large panel dataset (7.8 million observations, of which 7.2
> million are zeroes), and when I try to estimate a zero-inflated, negative
> binomial using the glmmADMB package, I get an error message and am at a
> loss.
> 
> The dataset has been declared as panel using the 'plm' package

  I'm not sure what that means, but I'm pretty sure that it's ignored by
glmmADMB (and lme4).

  I think this data set is simply not going to work with glmmADMB, which
tries to construct a *dense* Z model matrix.  I hope you realize that
this is an ambitious project!

, and is
> called 'pdat' (in use below). The time variable is 'month' (29 months of
> data) and the panel identifier is 'studentbin' (~270k studentbins). The
> estimation is about counts for student commencements ('commence') as a
> function of subsidy rates for each student bin over time, amongst other
> independent variables. All up, the .RData file is about ~200Mb of hard
> drive space.
> Here is the code:
> *> fit_zinb <- glmmadmb ( commence ~ subsidy + month + (1|studentbin),
> data=pdat, zeroInflation=TRUE, family="nbinom")*
> 

  Some questions and suggestions:

(1) run a hurdle model in glmer (i.e. two-stage: binomial 0 vs. positive
followed by a negative binomial analysis of the positive data only),
ignoring the fact that the second (NB) stage will be truncated (use
glmer.nb for the second stage). Run simulations etc. to figure out how
much that matters.

(2) Use expectation-maximization code (zipme at
https://groups.nceas.ucsb.edu/non-linear-modeling/projects/owls/R/owls_R_funs.R
) to fit a zero-inflated Poisson model, then correct for overdispersion
in the Poisson model in some reasonable way (e.g. compute the
overdispersion parameter phi via (sum(dev resids^2)/df.residual) and
inflate the standard errors by the square root of phi)

(3) hand-code your model in TMB (see https://github.com/adcomp) or AD
Model Builder, using the fact that you have a simple/nested design to
avoid constructing a gigantic random-effects model matrix (Z), i.e. your
code would be of the form

  eta[i] <- beta0+beta1*subsidy[i]+beta2*month[i]+u[studentbin[i]]
...

However, you'll still have to deal with the fact that you're doing
nonlinear optimization of a model with 270,000 (!) parameters ...

I don't know if anyone has come up with an iterative map/reduce
formulation for simple (single scalar random effect, or at least nested)
mixed models, but I would certainly be interested/that seems like a good
thing to do. (For this case, it should be possible to break the PWRSS
calculation up into independent blocks ...)