Linear model selection after bootstrapping without overfitting

Question

I am trying to develop a model between a 19 year record of climate data and a 19 year record of ice-off dates on rivers. The two variables are linearly correlated. The goal is to build a linear model so that we can use the climate data to predict the ice-off dates in future years when we don't have ice data but do have climate data.

What I have done thus far is bootstrapping: I randomly select 14 years as training data and the remaining 5 years as testing data. I build the linear model on the 14 year training dataset, then apply it to the remaining 5 years, and evaluate the model performance using the nash-sutcliffe coefficient (https://en.wikipedia.org/wiki/Nash–Sutcliffe_model_efficiency_coefficient#targetText=The%20Nash–Sutcliffe%20model%20efficiency,Qm%20is%20modeled%20discharge.). I then repeat that 1000 more times, randomly sampling the 14 years of training data each time.

Now that I have done this, I want to pick the best model of the bunch. Should I take the model with the median nash-sutcliff coefficient, or the one with the best nash-sutcliff coefficient? What is the best next step here that avoids overfitting?

I'm a statistics beginner, so your help is greatly appreciated!

Thank you for your help! That makes sense--forgive my ignorance in terms of terminology. Would something like nested cross validation (https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9) be a better option for timeseries data? SPecifically the 'day forward chaining' method — Ana, Oct 16 '19 at 23:35
Whoops, I had not yet read your answer below--I will take a look at that! — Ana, Oct 16 '19 at 23:47

score 1 · Accepted Answer · edited Nov 28 '22 at 21:16

AFAIK, bootstrapping is for getting the standard error estimates. If you want to validate models, look at N-fold cross-validation, or leave-one-out cross-validation (jackknife).

EDIT: since you said that years are not serially correlated, this simplifies your situation a lot! You can then treat the years as if they were just data points, and select the testing sets totally randomly. You can do e.g. 3-fold cross-validation by splitting the data in 3 randomly selected set of years, run the model 3 times (see how N-fold cross-validation works), and then put together all the 3 independently predicted testing sets from those 3 models and evaluate them with the Nash-Sutcliffe coefficient, which seems to be a good measure of the model efficiency.

Chose the model with the best coefficient. Cross-validation assures that you do not overfit in those single models tested. However, since during the model selection procedure itself you use the whole dataset, there is a risk that you can overfit in that model selection procedure itself, if you overdo it, as fittingly pictured in the chart below:

So, be careful not to select among too many model variants. Another way could be to put apart yet another validation set to monitor how are you doing in the model selection procedure itself :)

Also see the answer here and this reference:

Cawley and Talbot "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", JMLR, vol. 11, pp 2079−2107, 2010

Great, thank you! So just to make sure I am doing this correctly--would I create a linear model based on all of my data (19 yrs of PDO and 19 yrs of breakup), and then use 3 fold cross validation as a metric to tell how well that model might do. Or, do I do 3-fold cross validation first (which sort of gives 3 different models with different splits of training and testing data), and then somehow pick the best of the three? — Ana, Oct 17 '19 at 19:05
@Ana, you almost got it right :) So first you split the data randomly in N sets (say N=3, so you will have 6,6 and 7 years). Then you run 3 models, where one set is the testing one and the remaining two are training. You let each of those 3 models predict the training dataset, and then you put those 3 predictions together - so you get a complete independent prediction for the whole dataset. Independent meaning that it's free of overfitting :-) Then you validate all of this against the input data (get your NS coefficient). — Tomas, Oct 18 '19 at 07:42
@Ana, then you repeat this procedure for all models you want to test, and you pick the best model. To show its result, you run that model using all of your available data (19 years) as training dataset. Clearer now? :-) Feel free to ask. — Tomas, Oct 18 '19 at 07:44

score 1 · Answer 2 · answered Oct 16 '19 at 21:56

In the end, you pick a model which is estimated on the whole data. This is not a model which is a weighted combination of inferior models estimated on only a subset of the data (which you seem to be implying).

In the final, "optimal" model you need to have only statistically significant terms. For example, if the slope coefficient is non-significant according to the bootstrap-based p-value, the predictor has to be dropped from the model. This means that only the intercept remains provided it is statistically significant.

Finally, you are not performing bootstrap correctly. Each bootstrap sample must be created by sampling 19 data points with replacement. You, on the other hand, are sampling 14 data points without replacement. What you are doing is a randomized cross-validation, not bootstrap. Cross-validation is suitable for comparing different models but it is not suitable for estimating standard errors (and p-values) of the coefficients.

Yeah, this perfectly complements what I forgot in my answer, thanks :) — Tomas, Oct 16 '19 at 21:59
Thanks for the help! I got bootstrapping and cross-validation confused. While both datasets are correlated timeseries, they are not serially correlated (a high value one year does not cause a high value the next). Would this impact what sort of cross-validation that you do? — Ana, Oct 16 '19 at 23:45
@Ana great, this simplifies your situation a lot! I expanded on it in my answer. — Tomas, Oct 17 '19 at 09:02

score 0 · Answer 3 · answered Mar 23 '23 at 08:43

I like the answers from Tomas and stans, except I must strongly disagree with the idea that "you need to have only statistically significant terms" and that "if the slope coefficient is non-significant... the predictor has to be dropped from the model." Several predictors that seem insignificant individually may still exert a significant effect in combination, or several predictors may be spuriously insignificant in combination due to collinearity, etc, etc... I do not wish to get into a long rant about how p-values are widely misunderstood and are not intended for this purpose. Suffice to say that choosing what variables to include in your model purely on the basis of p-values is essentially stepwise regression. Stepwise regression has been widely used in the literature, but is now established to be an invalid methodology and is no longer recommended.

Linear model selection after bootstrapping without overfitting

3 Answers3