I have a dataset with a target variable and multiple independent variables. Some independent variables are highly correlated with each other (sometimes r>0.9). First, I thought i'd create a linear regression model using forward selection regardless of that fact. At the end I got a very good adjusted r square (0.82). But when I try to remove one of of the variables that were greatly correlated from the model, the R2 gets worse as well as rmse. For example, arvi, savi and rvi are very correlated with each other, but when I remove one of them, the r square gets worse. I want to know if I should leave them in the model or replace the group of 3 independent variable with only one of them.
- 65
-
Stepwise regression is fraught with problems. You might be starting off by making a problematic model. // If your stepwise algorithm that you trust (or trusted before you read the link) said to include those variables, and your measure of performance $(R^2_{adj})$ is worse with those variables excluded, what would be the argument for excluding those variables? – Dave Nov 01 '21 at 17:39
-
I don't really have an argument for excluding those variables apart from the value of R_squared and rmse. If I only keep savi (the most correlated variable with the target variable), p-values of other variables stay significant but r_square and rmse get worse but not that worse (from 0.82 to 0.79 for r_square). I heard that stepwise regression is bad, but I don't know which feature selection method I should use for this linear regression. – Ashraf.R Nov 01 '21 at 18:03
1 Answers
There might be reason to desire a model with fewer parameters. For instance, you might want to fit a model that is based on the variables that theoretically make sense.
However, a common reason for wanting to remove variables is because a high variable count risks overfitting. It is true that including many variables does risk overfitting. However, checking the adjusted $R^2$ to make sure you stil improve the model fit after being penalized for the high parameter count, or checking (particularly an out-of-sample) RMSE, helps you check for overfitting. In that sense, you have acknowledged that possibility that including additional parameters can risk for overfitting but you have done some due dilligence to guard against it.
It is particularly common to believe in this variable removal when variables are correlated. The thinking seems to be that the modeler can reduce the variable count to guard against overfitting while retaining most of the information contained in the variables (since much is duplicated). I sympathesize with this viewpoint and discuss here why it need not work out so cleanly.
Overall, "but they were correlated" seems like a weak justification for removing variables if your checks of overfitting (such as adjusted $R^2$ or an out-of-sample RMSE) indicate a better fit before you removed that correlated variable.
- 62,186