Comparing Regression models using the RSS (residual sum of squares)

Question

I have some questions about using the Residual Sum of Squares (RSS) as a criterion for selecting the best regression model. If we have two models $M_1$ and $M_2$, where $M_1 \subset M_2 $ as in that the variables used for $M_1$ are a subset of the variables used for $M_2$. Then we know that $RSS_2 \leq RSS_1$. (Is this because our method of calculation of the models' estimates would also minimize the $RSS$? (e.g. least-squares method). What if we used another method that is not based on minimizing the $RSS$?)

So since the above is true, we cannot compare such models using the $RSS$ as criterion. Then, can we use it for models $M_1$, $M_2$, such that one is not a subset of the other? For this case, we have two sub-cases. Either the models $M_1, M_2$ have the same or different number of variables. If they have the same number of variables, then I understand that it is logical to use the $RSS$ as a criterion since the abovementioned problem does not exist so smaller $RSS$ means better model. What about the case of different number of variables? Why can't we compare these models now that one is not a subset of the other? Is it because we want to set the number of variables as a baseline so one does not have an advantage over the other, in the sense that the model with the most variables will have an advantage just because of the number of its variables?

What do you want to learn by comparing the RSS? For example, do you want to know which should be expected to be the best at making forthcoming predictions? — Dave, Jun 27 '22 at 16:11
@Dave My context is model selection. We have a large number $K$ of variables and we want to select the best subset of these variables in the sense that it would give a better fitted model. — Fib, Jun 27 '22 at 16:16
Feature selection is fraught with problems. My post here discusses some of the reasons why and links to further reading. // The model that uses all of the variables is going to give the smallest RSS. While I have my own answer to this question, why is that not the end of your search, just use all of the variables and get the smallest RSS (and, therefore, the best fit)? — Dave, Jun 27 '22 at 16:18
@Dave What about the Bias-Variance tradeoff? The model using all the variables will have a great variance where our model will depend heavily on the training dataset. Also, it probably is of interest to have fewer variables for example to save on collection of the data which might be expensive — Fib, Jun 27 '22 at 16:50
My link discusses the bias-variance tradeoff. The gist is that you can reduce your variance by having fewer variables, but you might take such a hit to bias that you're worse off for having reduced the variance. // Is your primary interest in making future predictions (predictive modeling)? // Omission of variables in order to save on data collection expenses is an interesting idea that warrants its own question. That is going to depend on how much you gain by not collecting a variable vs how much you lose by having worse performance in your model that lacks a variable. — Dave, Jun 27 '22 at 17:05
@Dave Yes, excluding a variable will produce smaller variance for our model but increase the bias so our goal is to determine if it is worth to do so. Thus, we want to minimize the expected squared prediction error, which consists of the bias and variance of our prediction as well as the irreducible error (variance of random error). How does selecting the full model (containing all the variables) help with that goal? — Fib, Jun 28 '22 at 14:29
Do you plan to estimate that expected squared prediction error using out-of-sample testing? — Dave, Jun 28 '22 at 14:31
@Dave I am not sure about that. I assume you want to say that the expected squared prediction error (ESPE) is estimated by the RSS divided by the degrees of freedom therefore less RSS <=> less ESPE ? Edit: I may have got that a little wrong. RSS/N would maybe be an approximation of ESPE? RSS/(degrees of freedom) would be an estimation of the random error variance. — Fib, Jun 28 '22 at 14:42
I don't want to say anything. This is about what you want to do. — Dave, Jun 28 '22 at 14:43
@Dave At the moment I want to understand how the RSS is connected to the bias-variance tradeoff and why we don't just take the full model as you mentioned above, assuming our criterion is the prediction accuracy. — Fib, Jun 28 '22 at 14:50

Comparing Regression models using the RSS (residual sum of squares)

0 Answers0