2

I am trying to develop a model between a 19 year record of climate data and a 19 year record of ice-off dates on rivers. The two variables are linearly correlated. The goal is to build a linear model so that we can use the climate data to predict the ice-off dates in future years when we don't have ice data but do have climate data.

What I have done thus far is bootstrapping: I randomly select 14 years as training data and the remaining 5 years as testing data. I build the linear model on the 14 year training dataset, then apply it to the remaining 5 years, and evaluate the model performance using the nash-sutcliffe coefficient (https://en.wikipedia.org/wiki/Nash–Sutcliffe_model_efficiency_coefficient#targetText=The%20Nash–Sutcliffe%20model%20efficiency,Qm%20is%20modeled%20discharge.). I then repeat that 1000 more times, randomly sampling the 14 years of training data each time.

Now that I have done this, I want to pick the best model of the bunch. Should I take the model with the median nash-sutcliff coefficient, or the one with the best nash-sutcliff coefficient? What is the best next step here that avoids overfitting?

I'm a statistics beginner, so your help is greatly appreciated!

Tomas
  • 6,173
  • 14
  • 59
  • 105
Ana
  • 163
  • 1
  • The process you described isn't bootstrapping. Since you want to perform model selection, proper cross validation would be more appropriate anyway. 2) Since you have time series data, it's important perform resampling in a way that respects temporal dependence. There are specialized versions of cross validation and the bootstrap designed for time series. Using the ordinary versions will give biased estimates that make it seem like your model is performing better than it really is. Thus is a tricky issue that needs a lot of care.
  • – user20160 Oct 16 '19 at 23:25
  • Thank you for your help! That makes sense--forgive my ignorance in terms of terminology. Would something like nested cross validation (https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9) be a better option for timeseries data? SPecifically the 'day forward chaining' method – Ana Oct 16 '19 at 23:35
  • Whoops, I had not yet read your answer below--I will take a look at that! – Ana Oct 16 '19 at 23:47