[R] Behavior of ordered factors in glm

David Winsemius dwinsemius at comcast.net
Sun Jan 6 16:43:24 CET 2008


Thank you, Dr Ripley. After some false starts and consulting MASS2, 
Chambers&Hastie and the help files, this worked acceptably.

> xxx$issuecat2<-C(xxx$issuecat2,poly,1)
> attr(xxx$issuecat2,"contrasts")
                 .L
0-39  -6.324555e-01
40-49 -3.162278e-01
50-59 -3.287978e-17
60-69  3.162278e-01
70+    6.324555e-01

> exp.mdl<-glm(actual~gendercat+issuecat2+smokecat,   
                data=xxx,family="poisson",offset=expected)
> summary(exp.mdl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5596  -0.2327  -0.1671  -0.1199   5.2386  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.57125    0.06650 -68.743  < 2e-16 ***
gendercatMale    0.29660    0.06426   4.615 3.92e-06 ***
issuecat2.L      2.09161    0.09354  22.360  < 2e-16 ***
smokecatSmoker   0.22178    0.07870   2.818  0.00483 ** 
smokecatUnknown  0.02378    0.08607   0.276  0.78233 

The reference category is different, but the effect of a one category 
increase in age-decade on the log(rate) is(2.09*0.316) = 0.6604 which 
seems acceptable agreement with my earlier as.numeric(factor) estimate 
of 0.6614.

-- 
David Winsemius

Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote in
news:Pine.LNX.4.64.0801060706220.23958 at gannet.stats.ox.ac.uk: 

> Further to Duncan's comments, you can control factor codings via 
> options(contrasts=), by setting contrasts() on the factor and via
> C(). This does enable you to code an ordered factor as a linear
> term, for example.
> 
> The only place I know that this is discussed in any detail is in
> Bill Venables' account in MASS chapter 6.
> 
> On Sat, 5 Jan 2008, Duncan Murdoch wrote:
> 
>> On 05/01/2008 7:16 PM, David Winsemius wrote:
>>> David Winsemius <dwinsemius at comcast.net> wrote in
>>> news:Xns9A1CC05755274dNOTwinscomcast at 80.91.229.13:
>>>
>>>> I have a variable which is roughly age categories in decades. In
>>>> the original data, it came in coded:
>>>>> str(xxx)
>>>> 'data.frame':   58271 obs. of  29 variables:
>>>>  $ issuecat   : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1  1
>>>>  1...
>>>> snip
>>>>
>>>> I then defined issuecat as ordered:
>>>>> xxx$issuecat<-as.ordered(xxx$issuecat)
>>>> When I include issuecat in a glm model, the result makes me think
>>>> I have asked R for a linear+quadratic+cubic+quartic polynomial
>>>> fit. The results are not terribly surprising under that
>>>> interpretation, but I was hoping for only a linear term (which I
>>>> was taught to call a "test of trend"), at least as a starting
>>>> point. 
>>>>
>>>>> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson")
>>>>> summary(age.mdl)
>>>> Call:
>>>> glm(formula = actual ~ issuecat, family = "poisson", data = xxx)
>>>>
>>>> Deviance Residuals:
>>>>     Min       1Q   Median       3Q      Max
>>>> -0.3190  -0.2262  -0.1649  -0.1221   5.4776
>>>>
>>>> Coefficients:
>>>>             Estimate Std. Error z value Pr(>|z|)
>>>> (Intercept) -4.31321    0.04865 -88.665   <2e-16 ***
>>>> issuecat.L   2.12717    0.13328  15.960   <2e-16 ***
>>>> issuecat.Q  -0.06568    0.11842  -0.555    0.579
>>>> issuecat.C   0.08838    0.09737   0.908    0.364
>>>> issuecat^4  -0.02701    0.07786  -0.347    0.729
>>>>
>>>> This also means my advice to a another poster this morning may
>>>> have been misleading. I have tried puzzling out what I don't
>>>> understand by looking at indices or searching in MASSv2, the Blue
>>>> Book, Thompson's application of R to Agresti's text, and the FAQ,
>>>> so far without success. What I would like to achieve is having
>>>> the lowest age category be a reference category (with the
>>>> intercept being the log-rate) and each succeeding age category 
>>>> be incremented by 1. The linear estimate would be the
>>>> log(risk-ratio) for increasing ages. I don't want the higher
>>>> order polynomial estimates. Am I hoping for too much?
>>>>
>>>
>>> I acheived what I needed by:
>>>
>>>> xxx$agecat<-as.numeric(xxx$issuecat)
>>>> xxx$agecat<-xxx$agecat-1
>>>
>>> The results look quite sensible:
>>>> exp.mdl<-glm(actual~gendercat+agecat+smokecat, data=xxx,
>>> family="poisson", offset=expected)
>>>> summary(exp.mdl)
>>>
>>> Call:
>>> glm(formula = actual ~ gendercat + agecat + smokecat, family =
>>> "poisson",
>>>     data = xxx, offset = expected)
>>>
>>> Deviance Residuals:
>>>     Min       1Q   Median       3Q      Max
>>> -0.5596  -0.2327  -0.1671  -0.1199   5.2386
>>>
>>> Coefficients:
>>>                 Estimate Std. Error z value Pr(>|z|)
>>> (Intercept)     -5.89410    0.11009 -53.539  < 2e-16 ***
>>> gendercatMale    0.29660    0.06426   4.615 3.92e-06 ***
>>> agecat           0.66143    0.02958  22.360  < 2e-16 ***
>>> smokecatSmoker   0.22178    0.07870   2.818  0.00483 **
>>> smokecatUnknown  0.02378    0.08607   0.276  0.78233
>>>
>>> I remain curious about how to correctly control ordered factors,
>>> or I should just simply avoid them.
>>
>> If you're using a factor, R generally assumes you mean each level
>> is a different category, so you get levels-1 parameters.  If you
>> don't want this, you shouldn't use a factor:  convert to a numeric
>> scale, just as you did.
>>
>> Duncan Murdoch
>>




More information about the R-help mailing list