R squared comparison

Question

I have 5 features in my data. The R squared value when I use features 1,2, and 3 is $x$ and the R squared value when I use features 1,3, and 4 is $x + 0.1.$

Does this mean my second model is better than first model?

On this evaluation tool, yes. But not necessarily in terms of other ways of evaluating a model. — Galen, Sep 27 '22 at 03:21
What other ways are there? I understand R squared in general can't be used to make any model decisions accurately as it's value increases with addition of any new feature, but when we have same number of features but different features, I can only think of concluding one model is indeed better than other model. What else am I missing? — SuperUser, Sep 27 '22 at 03:32
$R^2$ depends on the distribution of the explanatory variables. The common features 1,3 can be eliminated, reducing the question to comparing two univariate regressions of the same response against two different variables, 2 & 4. You can obtain an arbitrarily large $R^2$ when one of those variables has an outlying value associated with a unique extreme value of the response but otherwise exhibits no general association. This is one reason to be wary of using $R^2$ as a measure of model quality. — whuber, Sep 27 '22 at 14:40

score 2 · Accepted Answer · answered Sep 27 '22 at 16:39

2

The answer comes down to what you might mean by "better." $R^2$ is an appropriate measure of goodness in an Ordinary Least Squares regression, provided you are confident all the conditions needed for its application apply.

Here is a simple example to illustrate the point.

A response variable $y$ is plotted on the vertical axes against two (uncorrelated) explanatory variables (both of which exhibit the same range from $1$ through $8$). The univariate least-squares fits and $R^2$ values are shown. You decide which is the better model. Is the issue really settled by a mere comparison of the $R^2$ values?

answered Sep 27 '22 at 16:39

whuber

322,774

Wonderful illustration, this helped me understand R square a little better. However, if I know that my data satisfies the OLS conditions, then do you think R squared can be used to judge which model is better (when the number of features is the same in both models)? In your answer, the assumptions for OLS fail for second model – SuperUser Sep 27 '22 at 20:52
1

What assumption do you believe fails? I chose these examples because there is no failure of the OLS assumptions in either one, but the assumptions you must make to apply the results differ in the two cases. The second model is a situation where one point has unduly high leverage and you have no way to assess nonlinearity. The first model exhibits a low outlying residual, but you don't have enough data to demonstrate it is an uncharacteristic outlier. – whuber Sep 27 '22 at 21:35
I thought the linear relation is missing in the second one. It needs some kind of transformation before fitting a linear line – SuperUser Sep 27 '22 at 22:03
I see your point. Thank you very much – SuperUser Sep 27 '22 at 22:16

score 0 · Answer 2 · answered Sep 27 '22 at 14:49

SuperUser, here are a couple of examples along the lines of what Galen suggested.

For example one model may be chasing (fitting) the "Noise" better than the other. One could assess Predictive R squared (a form of cross validation (leave one out)), and it is possible that the Higher R squared model has a lower Predictive R squared. R squared is highly dependent on the dataset and may not be representative of the ability to predict.

Here is a general introductory discussion of R squared vs Predictive R squared with some good links: https://www.datasciencecentral.com/alternatives-to-r-squared-with-pluses-and-minuses

Another example: Let's say the cost of getting the feature "4" is high (difficult to get), but feature "2" is easier (less costly) to obtain. The first model would be "better" even though the R squared is not as high.

Right. Better fit implies your model is better able to fit the train data but we don't know if its predictions are any good. Thanks — SuperUser, Sep 27 '22 at 20:55

score 0 · Answer 3 · answered Sep 27 '22 at 15:24

0

I would say you can't conclude that the second model is better just based on the R squared. For example, if we look at the cars dataset in R we can see that even by including a random variable with no relation to the dataset our R-Squared increases.

set.seed(1)
data("mtcars")
attach(mtcars)
two_pred <- lm(mpg~ disp + hp, data = mtcars)
summary(two_pred)$r.squared #0.748
random <- rnorm(nrow(mtcars))
two_pred_random <- lm(mpg~ disp + hp+ random, data = mtcars)
summary(two_pred_random)$r.squared #0.752

answered Sep 27 '22 at 15:24

Christopher Rounds

1

This doesn't address the situation of the question, though, which concerns an equal number of explanatory variables in both models. In your nested model there is a mathematical guarantee that $R^2$ will not decrease. That's not the case in the question. – whuber Sep 27 '22 at 15:32

R squared comparison

3 Answers3