2

I have a small dataset composed of 800 data points where I need to perform a regression task. I randomly chose 10% of the dataset to be used as validation.

The problem is that I am not sure if I am overfitting. I can see that RMSE and MAE for the validation dataset is worse than for the training dataset (as expected) but I cannot understand if it is to worse or not.

How can I understand if I am overfitting? How can I solve it?

Define the parameters of the model
params = list(
  objective = "regression",
  metric = "l1"
)

#Define LightGBM model model_lgbm_base = lgb.train( params = params, nrounds = 50, data = train_lgbm )

#Predict yhat_fit_base = predict(model_lgbm_base, as.matrix(train_model_x[, 2:12])) #Predict in the train data yhat_predict_base <- predict(model_lgbm_base, as.matrix(val_x[, 2:12])) #Predict in the validation data

#RMSE rmse_fit_base = RMSE(as.numeric(unlist(train_model_y)), yhat_fit_base) #2.101565 RMSE train rmse_predict_base = RMSE(as.numeric(unlist(val_y)), yhat_predict_base) #3.329543 RMSE val

#MAE mae_fit_base = MAE(as.numeric(unlist(train_model_y)), yhat_fit_base) #1.601823 MAE train mae_predict_base = MAE(as.numeric(unlist(val_y)), yhat_predict_base) #2.384942 MAE val

Rods2292
  • 371

1 Answers1

2

Yes, we are likely overfitting because we get "45%+ more error" moving from the training to the validation set. That said, overfitting is properly assessed by using a training, validation and a testing set. That is because we can still overfit the validation set, CV.SE has a very enlightening thread on Overfitting the validation set if one wants to look at this further.

In addition to the point above and more specific to this post: what we are missing is information about the variability of the performance metric used. For example, if our validation set performance is 80% and our test performance is 78% but we have a variability of $\pm 6\%$ in our performance metric, then we do not overfit. If we have variability of $\pm 0.6\%$ we do overfit. In order to get some idea of this variability we could use some sort of resampling technique (e.g. $k$-fold cross-validation is the simplest form). That is because simply put overfitting is "learning the noise" and resampling allows us to ameliorate noise's influence.

In terms of references, a very accessible yet thorough discussion can be found in Roberts et al. (2016) Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure, it deals with ecological data but the learnings are transferable across all domains. More particular to ML, Raschka (2018) Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning is a great short monograph on the comparison of ML algorithms and their variance due to data sampling.

usεr11852
  • 44,125