[R] How to avoid overfitting in gam(mgcv)

Ariyo Kanno 10dimensioner at gmail.com
Wed Oct 3 15:46:58 CEST 2007


Thank you for your advices.

I will try even increased "gamma" values, and all-out cross-validations.

2007/10/3, Frank E Harrell Jr <f.harrell at vanderbilt.edu>:
> Ariyo Kanno wrote:
> > Sorry, let me fix 1 sentence.
> >
> > "Here I try to mean by "overfitting" that GCV was significantly SMALLER
> > than the mean square error of prediction of the validation data, which
> > was randomly selected and not used for regression."
> >
> >> Thank you for valuable advices.
>
> If your test sample includes fewer than 10,000 cases and your signal to
> noise ratio is not large, your estimate of cross-validation accuracy may
> be unreliable.  Often 50-fold repeats of 10-fold cross-validation is
> required, without setting aside a single "test" sample.
>
> Frank
>
> >> I'm sorry Dr. N. Wood that by mistake I sent this reply firstly to
> >> your personal e-mail address.
> >>
> >> I will use the "min.sp" argument when the data size is very small. I'd
> >> like to know if there is any criteria for selecting "min.sp."
> >>
> >> I compared gamma=1.0 and 1.4, and I could see the smoothing effects of
> >>  enhancing gamma by comparing edf and smoothing parameter. But it was
> >> not enough to suppress the overfitting when data size was small.
> >>
> >> Here I try to mean by "overfitting" that GCV was significantly larger
> >> than the mean square error of prediction of the validation data, which
> >> was randomly selected and not used for regression.
> >>
> >> Best Wishes,
> >> Ariyo
> >>
> >> 2007/10/3, Simon Wood <s.wood at bath.ac.uk>:
> >>> On Wednesday 03 October 2007 10:49, Ariyo Kanno wrote:
> >>>> I appreciate your quick reply.
> >>>> I am using the model of the following structure :
> >>>>
> >>>> fit <- gam(y~x1+s(x2))
> >>>>
> >>>> ,where y, x1, and x2 are quantitative variables.
> >>>> So the response distribution is assumed to be gaussian(default).
> >>>>
> >>>> Now I understand that the data size was too small.
> >>> -- Well, the 10 end is definitely too small, but you can get quite reasonable
> >>> estimates of a single smoothing parameter from 30+ gaussian data.
> >>> -- You can force smoother models my either setting the smoothing parameter
> >>> yourself using the `sp' argument to `gam', or by using the `min.sp' argument
> >>> to set a lower bound on the smoothing parameter.
> >>> -- I'm suprised that `gamma' had no effect - how high did you try?
> >>>
> >>> best,
> >>> Simon
> >>>
> >>>
> >>>
> >>>> Thank you.
> >>>>
> >>>> Best Wishes,
> >>>>
> >>>> Ariyo
> >>>>
> >>>> 2007/10/3, Simon Wood <s.wood at bath.ac.uk>:
> >>>>> What sort of model structure are you using? In particular what is the
> >>>>> response distribution? For poisson and binomial then overfitting can be a
> >>>>> sign of overdispersion and quasipoisson or quasibinomial may be better.
> >>>>> Also I would not expect to get useful smoothing parameter estimates from
> >>>>> 10 data!
> >>>>>
> >>>>> best,
> >>>>> Simon
> >>>>>
> >>>>> On Wednesday 03 October 2007 06:55, ???? wrote:
> >>>>>> Dear listers,
> >>>>>>
> >>>>>> I'm using gam(from mgcv) for semi-parametric regression on small and
> >>>>>> noisy datasets(10 to 200
> >>>>>> observations), and facing a problem of overfitting.
> >>>>>>
> >>>>>> According to the book(Simon N. Wood / Generalized Additive Models: An
> >>>>>> Introduction with R), it is
> >>>>>> suggested to avoid overfitting by inflating the effective degrees of
> >>>>>> freedom in GCV evaluation with
> >>>>>> increased "gamma" value(e.g. 1.4). But in my case, it didn't make a
> >>>>>> significant change in the
> >>>>>> results.
> >>>>>>
> >>>>>> The only way I've found to suppress overfitting is to set the basis
> >>>>>> dimension "k" at very low values
> >>>>>> (3 to 5). However, I don't think this is reasonable because knots
> >>>>>> selection will then be an
> >>>>>> important issue.
> >>>>>>
> >>>>>> Is there any other means to avoid overfitting when alalyzing small
> >>>>>> datasets?
> >>>>>>
> >>>>>> Thank you for your help in advance,
> >>>>>> Ariyo Kanno
> >>>>>>
> >>>>>> --
> >>>>>> Ariyo Kanno
> >>>>>> 1st-year doctor's degree student at
> >>>>>> Institute of Environmental Studies,
> >>>>>> The University of Tokyo
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html and provide commented,
> >>>>>> minimal, self-contained, reproducible code.
> >>>>> --
> >>>>>
> >>>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
> >>>>>> +44 1225 386603  www.maths.bath.ac.uk/~sw283
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html and provide commented,
> >>>>> minimal, self-contained, reproducible code.
> >>> --
> >>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
> >>>> +44 1225 386603  www.maths.bath.ac.uk/~sw283
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >
> >
> > ------------------------------------------------------------------------
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt University
>



More information about the R-help mailing list