[R] gam (mgcv), multiple imputation, f-stats/p-values, and summary(gam)

Wed May 8 16:12:03 CEST 2013

Dear All,

I'm using gam for a project that involves multiple imputation, and it 
has led me to a question about how f-statistics/p-values work in gam.  
Specifically, how do the values in summary(gam) get generated?  As is 
made clear by the dumb example below, I'm manipul;ating gam objects to 
reflect the MI procedures that I'm using.  I don't trust the 
f-statistics/p-values that I'm getting, but I'd like to know how to 
further manipulate these objects to get trustworthy values.  Part of 
that invovles knowing how the output in summary(gam) gets generated, and 
from what elements of a gam object.

Here is the example:

library(mgcv)
par(mfrow=c(2,2))
impute <- function (a, a.impute){
     ifelse (is.na(a), a.impute, a)
     }

#make some dumb fake data
x1.knots = c(-1,-.5,0,.5,1)
x2.knots = c(-1,-.5,0,.5,1)
K=5
x1 = rnorm(100)
x2 = rnorm(100)
y = .05*exp(x1)-.5*x1 + .1*x2 + x1*x2 + rnorm(100)
#some of it is missing
x1[81:100] = NA
x2[71:90] = NA

#do a few dumb imputations, and fit models to the partially-imputed data
x1.imp = impute(x1, rnorm(100))
x2.imp = impute(x2, rnorm(100))
m1 = gam(y~te(x1.imp,x2.imp,k=c(K,K)),knots = list(x1.imp = x1.knots, 
x2.imp = x2.knots))
plot(m1,plot.type="contour",scheme=2,main="Imp 1")

x1.imp = impute(x1, rnorm(100))
x2.imp = impute(x2, rnorm(100))
m2 = gam(y~te(x1.imp,x2.imp,k=c(K,K)),knots = list(x1.imp = x1.knots, 
x2.imp = x2.knots))
plot(m2,plot.type="contour",scheme=2,main="Imp 2")

x1.imp = impute(x1, rnorm(100))
x2.imp = impute(x2, rnorm(100))
m3 = gam(y~te(x1.imp,x2.imp,k=c(K,K)),knots = list(x1.imp = x1.knots, 
x2.imp = x2.knots))
plot(m3,plot.type="contour",scheme=2,main="Imp 3")

results = list(m1,m2,m3)

#Combine the results according to rubin's rules
reps=3
bhat=results[[1]]$coeff
for (i in 2:reps){
     bhat=bhat+results[[i]]$coeff
     }
bhat = bhat/reps

W=results[[1]]$Vp
for (i in 2:reps){
     W = W+results[[i]]$Vp
     }
W = W/reps
B= (results[[1]]$coeff-bhat) %*% t(results[[1]]$coeff-bhat)
for (i in 2:reps){
     B = B+(results[[i]]$coeff-bhat) %*% t(results[[i]]$coeff-bhat)
     }
B=B/(reps-1)
Vb = W+(1+1/reps)*B

#Put the results into a convenient gam object
MI = results[[1]]
MI$coefficients=bhat
MI$Vp = Vb

#and summarize
summary(MI)
plot(MI,plot.type="contour",scheme=2,main="MI")

When I do something similar with non-fake data I get wacky f-statistics 
and p-values.  For example, F could be >100 and p could equal 1.  This 
probably has something to do with degrees of freedom.

My easy question:  what goes on under the hood with gam to generate 
these values?  What parts of a gam object are called up?

My harder question:  how might one construct principled analogs of these 
statistics in an MI context, when degrees of freedom will vary across 
models, depending on the imputed data?  Has anybody thought about this, 
or do I have have some serious pencil-and-paper ahead of me?

Thanks,
Andrew