[R] Possible overfitting of a GAM
Thomas L Jones, PhD
jones3745 at verizon.net
Sat Feb 16 23:25:03 CET 2008
The subject is a Generalized Additive Model. Experts caution us against
overfitting the data, which can cause inaccurate results. I am not a
statistician (my background is in Computer Science). Perhaps some kind soul
would take a look and vet the model for overfitting the data.
The study estimated the ebb and flow of traffic through a voting place. Just
one voting place was studied; the election was the U.S. mid-term election
about a year ago. Procedure: The voting day was divided into five-minute
bins, and the number of voters arriving in each bin was recorded. The voting
day was 13 hours long, giving 156 bins.
See http://tinyurl.com/36vzop for the scatterplot. There is a rather high
random variation, due in part to the fact that the bin width was
intentionally set to be narrow, in order to improve the amount of timing
information gathered.
http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used, with
the loess smoothing algorithm (locally weighted regression). The default
span was used. http://tinyurl.com/34av6l gives the scatterplot and the
fitted curve. The two seem to match reasonably well.
However, when I tried to generate the standard errors, things went awry.
(Please see http://tinyurl.com/38ej2t ) There are three curves, seemingly
the fitted curve and the curves for plus and minus two standard errors. The
shapes seem okay, but there are large errors in the y values.
Question: Have I overfitted the data?
Feedback?
Tom
Thomas L. Jones, PhD, Computer Science
More information about the R-help
mailing list