[Rd] nobs() with glm(family="poisson")
Ravi Varadhan
ravi.varadhan at jhu.edu
Wed Feb 27 23:56:40 CET 2013
This is getting further away from typical R-devel issues, but let me add another perspective: the `n' in BIC reflects the rate at which the information in the log-likelihood grows.
Ravi
-----Original Message-----
From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Steven McKinney
Sent: Wednesday, February 27, 2013 5:26 PM
To: 'Milan Bouchet-Valat'
Cc: r-devel
Subject: Re: [Rd] nobs() with glm(family="poisson")
> -----Original Message-----
> From: r-devel-bounces at r-project.org
> [mailto:r-devel-bounces at r-project.org]
> On Behalf Of Milan Bouchet-Valat
> Sent: February-27-13 12:56 PM
> To: peter dalgaard
> Cc: r-devel
> Subject: Re: [Rd] nobs() with glm(family="poisson")
>
> Thanks for the (critical, indeed) answer!
>
> Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit :
> > On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:
> >
> > > I cannot believes nobody cares about this -- or I'm completely
> > > wrong
> and
> > > in that case everybody should rush to put the shame on me... :-p
> >
> > Well, nobs() is the number of observations. If you have 5 Poisson
> > distributed counts, you have 5 observations.
> Well, say that to the statistical offices that spend millions to
> survey thousands of people with correct (but complex) sampling
> designs, they'll be happy to know that the collected data only
> provides an information equivalent to 5 independent outcomes. ;-)
Milan:
It seems to me you are mixing up Binomial and Poisson situations, and not assessing independence appropriately.
The above example discusses Bernoulli outcomes which are sometimes aggregated into Binomial "cases" depending on the study design.
Now if the survey samples people in the same household or even neighbourhood, those Bernoulli outcomes will not be independent (hence clustered survey techniques) and summing the Binomial denominators would not be appropriate, for the survey analysis or for BIC calculations. The "n" in the BIC calculation should reflect independent observations. If you knock on the same door 1000 times and ask the person who they will vote for, you do not have 1000 independent observations, even though your Binomial denominator is 1000.
The example you show from ?glm is a Poisson example showing
9 independent Poisson counts. If I count the number of cars passing through an intersection during non-overlapping one minute intervals (say 9 such intervals), then the number of observations I have is the number of non-overlapping one minute interval car count totals (e.g. the nine counts c(18, 17, 15, 20, 10, 20, 25, 13, 12)), not the number of cars I saw in total.
A piece of software that adds things up can not know the context from which the numbers were derived, so you have to figure out the level of independence appropriate to your study design and work out the BIC count accordingly.
Raftery alludes to this in a preceding section:
"When the data have been collected using a complex survey design with resulting weights, it is not yet clear what n should be, and this issue awaits further study. However, it seems reasonable that if the model is based on an assumption of simple random sampling but the sampling design is less efficient, then n should be reduced to reflect the efficiency of the sampling design relative to simple random sampling."
Steven McKinney
Statistician
Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre
>
> > If the number of observations is not the right thing to use in some
> > context, use the right thing instead. Changing the definition of
> > nobs() surely leads to madness.
> It is common usage in the literature using log-linear models to report
> the sum of counts as the number of observations. I think this indeed
> makes sense, but I'm not particularly attached to the choice of words
> -- let's call it as you please.
>
> The root issue is that nobs() was precisely introduced to be the basis
> for the BIC() function, as ?nobs states explicitly:
> > Extract the number of ‘observations’ from a model fit. This is
> > principally intended to be used in computing BIC (see ‘AIC’)
>
> So it's OK to say that the number of observations is the number of
> cells (even if I think this is not very user-friendly), but then the
> documentation is misleading, and the BIC() function returns incorrect
> values for the very first example provided in ?glm.
>
> > (I suppose that the fact that n is so obviously the wrong thing for
> > one particularly well-digested family of distribution functions
> > could be taken to indicate a generic weakness with the BIC.)
> I'm sure we can agree on the fact that BIC has its weaknesses (and I'm
> not the best person able to judge), but the point at stake is IMHO not
> one of them. After all, usual statistics for the Poisson family, such
> as deviance or residuals, are based on the sum of counts, not on the
> number of cells, and nobody objects.
>
> The fact that the AIC is perfectly integrated into S/R and BIC seems
> to be merely an historical detail to me. Computing the AIC itself
> requires
> glm.fit() to add twice the equivalent degrees of freedom to the value
> returned by the family function, so why would an equivalent
> special-casing of BIC be the sign of an intrinsic statistical
> deficiency? Maybe the BIC is not a good indicator, but technical
> arguments are somewhat secondary in that debate.
>
>
> Of course, if BIC is a bad indicator, BIC() should probably be
> discouraged in the documentation, and print a warning when the
> returned value is known to be incorrect.
>
>
> Regards
>
> > > In the meantime, I have come up with an alternative way of fixing this:
> > > when modeling count data, glm() could allow users to pass a table
> > > as
> the
> > > data argument, and convert it to a data frame using
> > > as.data.frame.table() instead of requiring the user to do it
> > > beforehand[1]. This would become the recommended way of fitting
> > > models for count data, and the fact that a table is passed could
> > > be used as
> the
> > > sign that nobs() should return the sum of cell counts instead of
> > > the number of rows in the data.frame.
> > >
> > >
> > > Regards
> > >
> > >
> > > 1: gnm already supports this pattern, with the additional
> > > advantage
> that
> > > e.g. fitted(), predict(), residuals() and weights() return an
> > > object of the same dimensions and dimnames as the original table.
> > >
> > >
> > > Le lundi 18 février 2013 à 12:22 +0100, Milan Bouchet-Valat a écrit :
> > >> Hi!
> > >>
> > >> The nobs() method for glm objects always returns the number of
> > >> cases with non-null weights in the data, which does not
> > >> correspond to the number of observations for Poisson regression/log-linear models, i.e.
> > >> when family="poisson" or family="quasipoisson".
> > >>
> > >> This sounds dangerous since nobs() is, as the documentation
> > >> states, primarily aimed at computing the Bayesian information criterion.
> Raftery
> > >> (1995:20) warned against this:
> > >>> What should n be? Once again, it is best to use the actual
> > >>> number of individuals, i.e. the sum of the cell counts, and not
> > >>> the number of cells (Raftery, 1986a).
> > >>
> > >> Is there a reason why this should not/cannot be done that way?
> > >>
> > >> This behavior can be reproduced with with R 3.0.0 from SVN, using
> > >> the example from ?glm:
> > >> counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9)
> > >> treatment <- gl(3,3)
> > >> glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
> > >> nobs(glm.D93)
> > >> # 9 == length(counts)
> > >> # Should be 150 == sum(counts)
> > >>
> > >> FWIW, stats:::nobs.glm is currently defined as:
> > >> nobs.glm <- function (object, ...)
> > >> if (!is.null(w <- object$prior.weights)) sum(w != 0) else
> length(object$residuals)
> > >>
> > >>
> > >> Thanks!
> > >>
> > >>
> > >> Raftery, Adrian E. 1995. “Bayesian Model Selection in Social
> Research.”
> > >> Sociological methodology 25:111–96.
> > >>
> > >> ______________________________________________
> > >> R-devel at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> > > ______________________________________________
> > > R-devel at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list