[R] R-squared value for linear regression passing through origin using lm()

Fri Oct 19 11:32:34 CEST 2007

G'day Ralf,

On Fri, 19 Oct 2007 09:51:37 +0200
Ralf Goertz <R_Goertz at web.de> wrote:

> Thanks to Thomas Lumley there is another convincing example. But still
> I've got a problem with it:
> 
> > x<-c(2,3,4);y<-c(2,3,3)
> 
> [...]
> That's okay, but neither [...] nor [...]
> give the result of summary(lm(y~x+0)), which is 0.9796. 

Why should either of those formula yield the output of
summary(lm(y~x+0)) ?  The R-squared output of that command is
documented in help(summary.lm):

r.squared: R^2, the 'fraction of variance explained by the model',

              R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2),

          where y* is the mean of y[i] if there is an intercept and
          zero otherwise.

And, indeed:

> 1-sum(residuals(lm(y~x+0))^2)/sum((y-0)^2)
[1] 0.9796238

confirms this.

Note: if you do not have an intercept in your model, the residuals do
not have to add to zero; and, typically, they will not.  Hence,
var(residuals(lm(y~x+0)) does not give you the residual sum of squares.

> In order to save the role of R^2 as a goodness-of-fit indicator 

R^2 is no goodness-of-fit indicator, neither in models with intercept
nor in models without intercept.  So I do not see how you can save its
role as a goodness-of-fit indicator. :)

Since you are posting from a .de domain, I assume you will understand
the following quote from Tutz (2000), "Die Analyse kategorialer Daten",
page 18:

R^2 misst *nicht* die Anpassungsguete des linearen Modelles, es sagt
nichts darueber aus, ob der lineare Ansatz wahr oder falsch ist, sondern
nur ob durch den linearen Ansatz individuelle Beobachtungen
vorhersagbar sind.  R^2 wird wesentlich vom Design, d.h. den Werten,
die x annimmt bestimmt (vgl. Kockelkorn (1998)).  

The latter reference is:

Kockelkorn, U. (1998).  Lineare Modelle. Skript, TU Berlin.

> in zero intercept models one could use the same formula like in models
> with a constant. I mean, if R^2 is the proportion of variance
> explained by the model we should use the a priori variance of y[i].
> 
> > 1-var(residuals(lm(y~x+0)))/var(y)
> [1] 0.3567182
> 
> But I assume that this has probably been discussed at length somewhere
> more appropriate than r-help.

I am sure about that, but it was also discussed here on r-help (long
ago).  The problem is that this compares two models that are not nested
in each other which is a quite controversial thing to do; some might
even go so far as saying that it makes no sense at all.  The other
problem with this approaches is illustrated by my example:

> set.seed(20070807)
> x <- runif(100)*2+10
> y <- 4+rnorm(x, sd=1)
> 1-var(residuals(lm(y~x+0)))/var(y)
[1] -0.04848273

How do you explain that a quantity that is called R-squared, implying
that it is the square of something, hence always non-negative, can
become negative?

Cheers,

	Berwin

=========================== Full address =============================
Berwin A Turlach                            Tel.: +65 6515 4416 (secr)
Dept of Statistics and Applied Probability        +65 6515 6650 (self)
Faculty of Science                          FAX : +65 6872 3919       
National University of Singapore
6 Science Drive 2, Blk S16, Level 7          e-mail: statba at nus.edu.sg
Singapore 117546                    http://www.stat.nus.edu.sg/~statba