[R] lean and mean lm/glm?
    Thomas Lumley 
    tlumley at u.washington.edu
       
    Wed Aug 23 17:25:54 CEST 2006
    
    
  
On Wed, 23 Aug 2006, Damien Moore wrote:
>
> Thomas Lumley wrote:
>
>> No, it is quite straightforward if you are willing to make multiple passes
>> through the data. It is hard with a single pass and may not be possible
>> unless the data are in random order.
>>
>> Fisher scoring for glms is just an iterative weighted least squares
>> calculation using a set of 'working' weights and 'working' response. These
>> can be defined chunk by chunk and fed to biglm. Three iterations should
>> be sufficient.
>
> (NB: Although not stated clearly I was referring to single pass when I 
> wrote "impossible"). Doing as you suggest with multiple passes would 
> entail either sticking the database input calls into the main iterative 
> loop of a lookalike glm.fit or lumping the user with a very unattractive 
> sequence of calls:
I have written most of a bigglm() function where the data= argument is a 
function with a single argument 'reset'. When called with reset=FALSE the 
function should return another chunk of data, or NULL if no data are 
available, and when called with reset=TRUE it should go back to the 
beginning of the data.  I don't think this is too inelegant.
In general I don't think a one-pass algorithm is possible. If the data are 
in random order then you could read one chunk, fit a glm, and set up a 
grid of coefficient values around the estimate.  You then read the rest of 
the data, computing the loglikelihood and score function at each point in 
the grid.  After reading all the data you can then fit a suitable smooth 
surface to the loglikelihood.  I don't know whether this will give 
sufficient accuracy, though.
For really big data sets you are probably better off with the approach 
that Brian Ripley and Fei Chen used -- they have shown that it works and 
there unlikely to be anything much simpler that also works that they 
missed.
 	-thomas
Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
    
    
More information about the R-help
mailing list