I have some questions about using the Residual Sum of Squares (RSS) as a criterion for selecting the best regression model. If we have two models $M_1$ and $M_2$, where $M_1 \subset M_2 $ as in that the variables used for $M_1$ are a subset of the variables used for $M_2$. Then we know that $RSS_2 \leq RSS_1$. (Is this because our method of calculation of the models' estimates would also minimize the $RSS$? (e.g. least-squares method). What if we used another method that is not based on minimizing the $RSS$?)
So since the above is true, we cannot compare such models using the $RSS$ as criterion. Then, can we use it for models $M_1$, $M_2$, such that one is not a subset of the other? For this case, we have two sub-cases. Either the models $M_1, M_2$ have the same or different number of variables. If they have the same number of variables, then I understand that it is logical to use the $RSS$ as a criterion since the abovementioned problem does not exist so smaller $RSS$ means better model. What about the case of different number of variables? Why can't we compare these models now that one is not a subset of the other? Is it because we want to set the number of variables as a baseline so one does not have an advantage over the other, in the sense that the model with the most variables will have an advantage just because of the number of its variables?