[R] Maximum number of variables allowed in a multiple linearregression model
Tony Plate
tplate at acm.org
Wed Feb 6 18:28:34 CET 2008
Bert Gunter wrote:
> I strongly suggest you collaborate with a local statistician. I can think of
> no circumstance where multiple regression on "hundreds of thousands of
> variables" is anything more than a fancy random number generator.
That sounds like a challenge! What is the largest regression problem (in
terms of numbers of variables) that people have encountered where it made
sense to do some sort of linear regression (and gave useful results)?
(Including multilevel and Bayesian techniques.)
However, the original poster did say "hundreds to thousands", which is
smaller than "hundreds of thousands". When I try a regression problem with
3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory
on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R
2.6.1 runs out of memory (apparently trying to duplicate the model matrix):
R version 2.6.1 (2007-11-26)
Copyright (C) 2007 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
> m <- 3000
> n <- m * 10
> x <- matrix(rnorm(n*m), ncol=m, nrow=n,
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> dim(x)
[1] 30000 3000
> k <- sample(m, 10)
> y <- rowSums(x[,k]) + 10 * rnorm(n)
> fit <- lm.fit(y=y, x=x)
Error: cannot allocate vector of size 686.6 Mb
> object.size(x)/2^20
[1] 687.7787
> memory.size()
[1] -2022.552
>
and the Windows process monitor shows the peak memory usage for Rgui.exe at
2,137,923K. But in a 64 bit version of R, I would be surprised if it was
not possible to run this (given sufficient memory).
However, R easily handles a slightly smaller problem:
> m <- 1000 # of variables
> n <- m * 10 # of rows
> k <- sample(m, 10)
> x <- matrix(rnorm(n*m), ncol=m, nrow=n,
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> y <- rowSums(x[,k]) + 10 * rnorm(n)
> fit <- lm.fit(y=y, x=x)
> # distribution of coefs that should be one vs zero
> round(rbind(one=quantile(fit$coefficients[k]),
zero=quantile(fit$coefficients[-k])), digits=2)
0% 25% 50% 75% 100%
one 0.94 0.98 1.04 1.10 1.18
zero -0.30 -0.08 -0.01 0.06 0.29
>
To echo Bert Gunter's cautions, one must be careful doing ordinary linear
regression with large numbers of coefficients. It does seem a little
unlikely that there is sufficient data to get useful estimates of three
thousand coefficients using linear regression in data managed in Excel
(though I guess it could be possible using Excel 12.0, which can handle up
to 1 million rows - recent versions prior to 2008 could handle on 64K rows
- see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the
suggestion to consult a local statistician is good advice - there may be
other more suitable approaches, and if some form of linear regression is an
appropriate approach, there are things to do to gain confidence that the
results of the linear regression convey useful information.
-- Tony Plate
>
> -- Bert Gunter
> Genentech Nonclinical Statistics
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Michelle Chu
> Sent: Tuesday, February 05, 2008 9:00 AM
> To: R-help at r-project.org
> Subject: [R] Maximum number of variables allowed in a multiple
> linearregression model
>
> Hi,
>
> I appreciate it if someone can confirm the maximum number of variables
> allowed in a multiple linear regression model. Currently, I am looking for
> a software with the capacity of handling approximately 3,000 variables. I
> am using Excel to process the results. Any information for processing a
> matrix from Excel with hundreds to thousands of variables will helpful.
>
> Best Regards,
> Michelle
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list