[R-sig-ME] Troublesome example of lmer fitting an overidentified model; Stata bad too.

Wed Feb 24 18:12:00 CET 2016

I'm teaching mixed models, working through a lot of homework problems
in the 2 volume set Multilevel and Longitudinal Modeling using Stata,
by S Rave-Hesketh and A. Skrondal. If you did not look at it yet, I
expect you'll find lots of useful, interesting exercises.

Their Exercise 4.4 is about yield from 10 wheat varieties.  The variables are:

"plot"     "variety"  "yield"    "moist"

variety is the grouping variable in the exercise, but it comes in as a
floating point number. I thought that was the root of the surprise
described here, but now I don't think so.

I was reading one student's homework in Stata and I was
stunned/surprised that Stata's "mixed" function would fit a random
effects model that included the grouping variable as a fixed effect
predictor as well.  This appeared to be grossly overidentified to me.

Here's the stata input and output,

. mixed yield c.moist##i.variety || variety:moist, mle stddev

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0:   log likelihood = -42.123671
Iteration 1:   log likelihood = -41.540417
Iteration 2:   log likelihood = -41.520012
Iteration 3:   log likelihood = -41.520012

Computing standard errors:

Mixed-effects ML regression                     Number of obs     =         60
Group variable: variety                         Number of groups  =         10

                                                Obs per group:
                                                              min =          6
                                                              avg =        6.0
                                                              max =          6

                                                Wald chi2(19)     =   37795.42
Log likelihood = -41.520012                     Prob > chi2       =     0.0000

---------------------------------------------------------------------------------
          yield |      Coef.   Std. Err.      z    P>|z|     [95%
Conf. Interval]
----------------+----------------------------------------------------------------
          moist |   .6077128   .0124644    48.76   0.000      .583283
  .6321425
                |
        variety |
             2  |  -2.844581   .8429087    -3.37   0.001    -4.496652
 -1.192511
             3  |  -1.871277   .7903544    -2.37   0.018    -3.420343
 -.3222105
             4  |  -.3432833   .7323671    -0.47   0.639    -1.778696
   1.09213
             5  |   .2294764   1.121912     0.20   0.838     -1.96943
  2.428383
             6  |    3.46207    .690504     5.01   0.000     2.108707
  4.815433
             7  |  -12.05622   .6729943   -17.91   0.000    -13.37527
 -10.73718
             8  |   1.175528   .7302294     1.61   0.107    -.2556957
  2.606751
             9  |  -1.434985   .7917805    -1.81   0.070    -2.986846
  .1168767
            10  |   3.494313   1.392236     2.51   0.012     .7655807
  6.223044
                |
variety#c.moist |
             2  |  -.0354326   .0260328    -1.36   0.173    -.0864559
  .0155908
             3  |   .1297872   .0190785     6.80   0.000     .0923941
  .1671804
             4  |   .0264085   .0210603     1.25   0.210    -.0148689
   .067686
             5  |   .0284318   .0240342     1.18   0.237    -.0186744
   .075538
             6  |   .0790988   .0161154     4.91   0.000     .0475132
  .1106844
             7  |   .1164752   .0189384     6.15   0.000     .0793566
  .1535938
             8  |   .0791545   .0188487     4.20   0.000     .0422118
  .1160972
             9  |    .080266   .0190236     4.22   0.000     .0429804
  .1175516
            10  |   .0027604   .0282729     0.10   0.922    -.0526535
  .0581742
                |
          _cons |   34.58378   .5478187    63.13   0.000     33.51007
  35.65748
---------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
variety: Independent         |
                   sd(moist) |   3.46e-15   2.27e-14      8.96e-21    1.34e-09
                   sd(_cons) |   1.18e-11   1.34e-07             0           .
-----------------------------+------------------------------------------------
                sd(Residual) |   .4833867   .0441285      .4041925    .5780976
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 0.00                  Prob > chi2 = 1.0000

Note: LR test is conservative and provided only for reference.

As you can see, Stata deals with the overidentification by reporting
miniscule estimates for the random effects.

I think its a bug, wondered what lme4 would do. I predicted it would
refuse to fit at all, as soon as we declare variety as a factor
(varietyf).  However, that does not happen. The full R runable example
is below, but here's the bad part:

> m.wrong <- lmer(yield ~ moist*varietyf + (moist|varietyf), data = dat)
> summary(m.wrong)
Linear mixed model fit by REML ['lmerMod']
Formula: yield ~ moist * varietyf + (moist | varietyf)
   Data: dat

REML criterion at convergence: 157.7

Scaled residuals:
     Min       1Q   Median       3Q      Max
-1.74078 -0.52638 -0.00039  0.56385  1.79231

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 varietyf (Intercept) 0.3505   0.592
          moist       0.3505   0.592    0.00
 Residual             0.3505   0.592
Number of obs: 60, groups:  varietyf, 10

Fixed effects:
                  Estimate Std. Error t value
(Intercept)       34.58378    0.89479   38.65
moist              0.60771    0.59222    1.03
varietyf2         -2.84458    1.32918   -2.14
varietyf3         -1.87128    1.27984   -1.46
varietyf4         -0.34328    1.22700   -0.28
varietyf5          0.22948    1.60904    0.14
varietyf6          3.46207    1.19003    2.91
varietyf7        -12.05622    1.17489  -10.26
varietyf8          1.17553    1.22509    0.96
varietyf9         -1.43498    1.28116   -1.12
varietyf10         3.49431    1.89960    1.84
moist:varietyf2   -0.03543    0.83786   -0.04
moist:varietyf3    0.12979    0.83758    0.15
moist:varietyf4    0.02641    0.83765    0.03
moist:varietyf5    0.02843    0.83777    0.03
moist:varietyf6    0.07910    0.83748    0.09
moist:varietyf7    0.11647    0.83757    0.14
moist:varietyf8    0.07915    0.83757    0.09
moist:varietyf9    0.08027    0.83757    0.10
moist:varietyf10   0.00276    0.83797    0.00

Here' s my full R

## Paul Johnson <pauljohn at ku.edu>
## 20160224
library(foreign)
dat <- read.dta("http://www.stata-press.com/data/mlmus3/wheat.dta")
## write.dta(dat, file = "whead.dta12")
## dat <- read.dta("wheat.dta12")
summary(dat)
## Variety should be an integer, so should plot, but we don't use it
dat$varietyf <- as.factor(dat$variety)

library(lme4)

## This is obviously overidentified/incoherent.
## lmer should bounce the user out.
m.wrong <- lmer(yield ~ moist*varietyf + (moist|varietyf), data = dat)
summary(m.wrong)

## Continue with the homework
m1 <-  lmer(yield ~ moist + (1|varietyf), data = dat)

m2 <- lmer(yield ~ moist + (moist|varietyf), data = dat)
library(lattice)
dotplot(ranef(m2, condVar = TRUE))

## Here's a shocker
anova(m2, m1)
## Really? Look at the pictures!

## Here's a spaghetti plot.
plot(m2)
m2coef <- coef(m2)[["varietyf"]]
m2coef$variety <- rownames(m2coef)
plot(yield ~ moist, data = dat, col = dat$variety)
apply(m2coef, 1, function(x){ browser();  abline(a = x[1], b = x[2],
col = x["variety"])})

## If Var(e) is small enough, even trivial slope differences are
## "statistically significant".

-- 
Paul E. Johnson
Professor, Political Science        Director
1541 Lilac Lane, Room 504      Center for Research Methods
University of Kansas                 University of Kansas
http://pj.freefaculty.org              http://crmda.ku.edu