0

So I have a dataset which has 84 rows and 18 predictors. I need to predict Annual_salary based on the data provided.

enter image description here

I did create a model after changing all the categorical variables to factors. Then I used olsrr package to get best subset and got it to be around 7 predictors. Using those predictors I created a model to get rmse around 240,000.

How can I reduce this rmse? I'm certainly new to R programming and need some help. What are the steps I should go about with this data and in what order?

What are some of the other models(algorithms) that I can use to predict? I have used lasso/ridge and Multiple linear regression and I've got lowest rmse using MLR. I might have not implemented any of these correctly so I don't really know if the previous statement is valid.

Link to the data : GDrive

Update: I tried Random Forests with 7 predictors and got rmse around 106k (using entire data for training) on the placement_test data. I'm able to get these scores because its a kaggle competition and its giving rmsescores when I submit the scores.

  • 84 rows and 18 predictors (some of which are categorical with multiple levels) looks like a textbook case for overfitting, even if you do "best subset" variable selection. There is not very much you can do. Best would be to understand your data, which in your situation is probably not a realistic next step. Be aware that the longer you tweak your data or try other algorithms to improve on the test set, the more likely you are to overfit to this specific set, with poor generalization. – Stephan Kolassa Nov 27 '22 at 09:35
  • @StephanKolassa I believe what you're saying is exactly what I'm getting into. However, how do I know if I'm overfitting the data? I tried ols step all possible subsets and got 7 predictor model to have the highest adjr2, lowest cp and bic. Does it still imply I might be overfitting? How do I stop myself from doing so? – Sharan Shetty Nov 27 '22 at 18:31
  • You would have to assess your model on another new dataset, which was never used in training. Compare it to a very simple models, e.g., all the models with no more than one or two predictors. Which you probably can't do, this being a Kaggle competition. – Stephan Kolassa Nov 27 '22 at 20:20
  • Compare the recent M5 forecasting competition, also on Kaggle: the leaderboards changed drastically when they evaluated the final testing data, which implies to me a strong degree of overfitting to the "unknown" validation set (which was not shown to contestants, but they could evaluate their models five times a day on this "unseen" data - this was obviously enough for people to overfit). – Stephan Kolassa Nov 27 '22 at 20:20
  • @StephanKolassa when you say I need to assess my model on a new dataset, I do assess my model on the unseen data as the placement_test.csv does not have the y_test values. Is there still a chance of overfitting when they evaluate it on the final testing data? – Sharan Shetty Nov 28 '22 at 03:48
  • Yes, that is precisely what happened at the M5. People submitted predictions for unseen test data to the server and got back evaluations. They then used these to "improve" their models. After a while, these test data were revealed, people tuned their models further and finally submitted a single prediction for completely unseen validation data. The leaderboard shifted dramatically between the leaders on the test data and those on the validation data - because you can easily overfit to test data even if you see your evaluation only. So yes, to avoid this you need completely new data. – Stephan Kolassa Nov 28 '22 at 07:32
  • Hey! Then what do you suggest I do to get the overall rmse lower, in the right way? I've also sent a connect over Linkedin if you're further intrested. – Sharan Shetty Nov 28 '22 at 07:46
  • Honestly, with just 84 rows, there is simply very little you can do reliably, especially with complex data-driven methods. I would try theory-driven approaches and time-box the analysis, and in the final write-up draw attention to the fact that there is a clear danger of overfitting. – Stephan Kolassa Nov 28 '22 at 08:09
  • Alright! Thanks. – Sharan Shetty Nov 28 '22 at 14:42

0 Answers0