0

I'm quite new to this, and learning R-squared and Machine Learning side by side. The problem I'm having may be my programming, or it may be that my stats knowledge is a little off.

I'm finding that after adding more predictors to a linear model using the caret library, R-squared appears to decrease when using the postResample function. This also differs from R-squared presented when running summary(model).

When adding multiple variables - carat, depth, z, and x:

  • summary(model) R-squared is .95
  • postResample R-squared is .79

If I only use one variable - carat:

  • summary(model) R-squared is .91

  • PostResample R-squared is .85.

I'm not sure why postResample is giving a totally different R-squared. It may be that I've misunderstood its purpose. I'm also not sure why it decreases as I add more variables to the model.

My code:

# Target
y <- Diamonds$price

Create model

model <- train(y ~ carat + depth + z + x, data = diamonds, method = "lm") summary(model)

Extract predicted values from model

ypred <- fitted(model)

postResample(ypred, y)

1 Answers1

0

I'm not 100% sure, if this source is any valid, but I found this note in chapter 17 http://topepo.github.io/caret/index.html, an explanation in the caret package

A note about how R2 is calculated by caret (own adding: postResample is a package from caret): it takes the straightforward approach of computing the correlation between the observed and predicted values (i.e. R) and squaring the value. When the model is poor, this can lead to differences between this estimator and the more widely known estimate derived form linear regression models.

While the latter is your summary approach. So they are just taking correlations between both values. Maybe you try the https://rdrr.io/rforge/miscTools/man/rSquared.html misctools rsquared approach. Its like yours in summary.

  • 1
    Using correlation between predicted and observed values is a completely invalid approach as it allows for automatic recalibration. I.e. you can be off by a factor of 2 and have a perfect correlation. The formula to use for $R^2$ must be the primal estimator $1 - \frac{SSE}{SST}$ where $SSE$ is the sum of squared errors and $SST$ is the total sum of squares (variance of Y multiplied by $n-1$). – Frank Harrell Feb 12 '22 at 16:26
  • Thanks for answering, could you please highligh,t how your answer is related to the OP's question or my answer? The OP asked about, why his R² approaches differ with taking more predictors, and I recommended to look at the method he used and compare it to the rSquared method of R, which is different from the postResample function. He didn't mention anything about deviation between predicted values and observed values. Maybe im too unexperienced :-) but I do not see the connection between your annotation and everything posted here, please help me. – Patrick Bormann Feb 12 '22 at 17:20
  • My comment was concerning your answer and was not a comment directed to the OP. You stated that $R^2$ was computed using a correlation coefficient. That is how you can compute the apparent $R^2$ in the training sample (because the automatic recalibration doesn't matter, i.e., the recalibration slope and intercept are 1.0 and 0.0). It's not how you compute $R^2$ in a validation sample. For that you need the "raw unfixed" $R^2$. – Frank Harrell Feb 13 '22 at 00:44
  • I stated nothing to be honest, I quoted a source, which claimed to do a correlation. The OP was concerned about a differing R2 depending on the method he chose, and I tried to tell him that the function he uses, is different from a standard Rsquared in R. Anyway, do you have any source for calculating an apparent R2 vs a validation R2?I'm curious what do you mean by that. Maybe you are referring to something I know, but you use different terms? – Patrick Bormann Feb 13 '22 at 10:47
  • Calculations will disagree if one method uses a secretly-recalibrated correlation coefficient and the other computes sum of squared errors uncalibrated. Or I don't understand your question. – Frank Harrell Feb 13 '22 at 13:13
  • I give some Python. simulations here that show what happens when you measure out-of-sample squared correlation between predictions and observations. Spoiler alert: it’s bad news! (The rest of the post besides the Python code is more-or-less unrelated to the matter at hand.) – Dave Oct 23 '22 at 02:40