Using R^2 in nonlinear regression

Question

This page discusses why Minitab does not compute $R^2$ for nonlinear regression. I understand that calculating $R^2$ between the response and the predictor ($y$ vs $x$) is not justified. However, is there any reason why calculating $R^2$ between the response and the predicted response ($y$ vs $\hat{y}$) is invalid?

(I know there are other goodness-of-fit metrics that may be better suited for nonlinear regression, but in this case I'm interested in $R^2$.)

@Sal You must have dropped several words from that comment: I'm having trouble finding any interpretation that is correct. — whuber, Jan 14 '20 at 23:13
Yikes. Thanks. That comment wasn't meant to be published yet. :) . Anyway, my intended point was: If you calculate an r-squared between y and y-hat, that may indicate that e.g. the linear relationship between y and y-hat is strong, but doesn't necessarily indicate that the y and y-hat values are similar in value. You might look at measures of "accuracy". — Sal Mangiafico, Jan 14 '20 at 23:32
Possible dups: https://stats.stackexchange.com/questions/79225/is-r2-valid-in-a-nonlinear-model, https://stats.stackexchange.com/questions/168599/should-we-report-r-squared-or-adjusted-r-squared-in-non-linear-regression, https://stats.stackexchange.com/questions/359906/is-r-squared-truly-an-invalid-metric-for-non-linear-models — kjetil b halvorsen, Jan 15 '20 at 01:51
While I concede that this is shameful self-promotion, you might want to look at the first half of my post here: https://stats.stackexchange.com/questions/427390/neural-net-regression-sse-loss. When you expand the total sum of squares, there’s a term that drops out in linear regression but remains for nonlinear models. — Dave, Jan 15 '20 at 01:57
@kjetilbhalvorsen I think these are all related, but not quite the same question. If they're deemed similar enough feel free to close this. — Pete, Jan 15 '20 at 06:21
@Dave thanks for sharing, although I think also doesn't quite get at my question either — Pete, Jan 15 '20 at 06:22
In linear regression, $R^2$ tells you how much of the variability in $y$ is explained by the model. Because that “other” term in my post is zero, that tells you how much variability remains unaccounted for. When “other” is not zero, you get no such insight. — Dave, Jan 15 '20 at 10:36
One thing to look at is Efron's pseudo r-square. It is based on the difference between predicted y values and observed y values, so its explanation is pretty intuitive. — Sal Mangiafico, Jan 18 '20 at 16:31
@SalMangiafico "but doesn't necessarily indicate that the y and y-hat values are similar in value" this is also true for linear regression, right? — Alexis, Mar 30 '23 at 17:51
@Alexis In-sample, for OLS estimation of a linear model that has an intercept, the squared correlation between predicted and observed values is a function of the residual sum of squares. — Dave, Mar 30 '23 at 17:55
@Dave Can you more directly link your comment to my previous one? (I did follow the link to you self-promoted lovely question and answer, BTW. :) — Alexis, Mar 30 '23 at 17:56
The residual sum of squares is a measure of how similar the $y$ and $\hat y$ values are: $RSS = \sum\left(y_i - \hat y_i\right)^2$. — Dave, Mar 30 '23 at 17:57
Yes, but I believe my comment still remains salient in linear regression both for a specific value of $x$ (e.g., $x_i$ or $x^$), and small ranges* of $x$, particularly when the linearity assumption is strictly violated. — Alexis, Mar 30 '23 at 17:59
@Alexis That makes it sound like the model would have a poor fit, which sounds like a situation where $y$ and $\hat y$ are rather dissimilar. I'm not sure I'm following what you're trying to convey. — Dave, Mar 30 '23 at 18:00
I read the portion of @SalMangiafico's comment "but doesn't necessarily indicate that the y and y-hat values are similar in value" as connoting that this only applies to nonlinear regression, and hence my question (which I thought was straightforward, can I improve it somehow?). — Alexis, Mar 30 '23 at 18:03
@Alexis I think the reconciliation is that the squared correlation between predictions and observations does relate to the residual sum of squares in OLS linear regression (with an intercept), while that need not be the case in other settings. — Dave, Mar 30 '23 at 18:05
I think my comment has been sufficiently explored through the additional comments, and the posted answers. But just to give an example, given X = c(1,2,3,4,5,6); Y = c(4,5,6,8,7,9), and the model, model1 = lm(Y ~ X), cor(Y, predict(model1))^2 = 0.889. But if we use, say, a linear model without the intercept, model2 = lm(Y ~ X + 0), the correlation between y and y-hat is still 0.889, even though the fit is relatively poor. On the other hand, pseudo r-squares, a la # 3 in @Dave 's answer, are 0.889 and 0.214. — Sal Mangiafico, Mar 31 '23 at 23:36
It came to my attention the existence of "Forecast skill scores", which are to scoring rules what deviance is to log-likelihoods. So there's also another avenue for R-squared there. — Firebug, Feb 10 '24 at 21:27

Dave · Answer 1 · 2023-04-26T14:54:01.477

It depends on what is meant by $R^2$. In simple settings, multiple definitions give equal values.

Squared correlation between the feature and outcome, $(\text{corr}(x,y))^2$, at least for simple linear regression with just one feature
Squared correlation between the true and predicted outcomes, $(\text{corr}(y,\hat y))^2$
A comparison of model performance, in terms of square loss (sum or squares errors), to the performance of a model that predicts $\bar y$ every time
The proportion of variance in $y$ that is explained by the regression

In more complicated settings, these are not all equal. Thus, it is not clear what constitutes the calculation of $R^2$ in such a situation.

I would say that #1 does not make sense unless we are interested in a linear model between two variables. However, that leaves the second option as viable. Unfortunately, this correlation need not have much to do with how close the predictions are to the true values. For instance, whether you predict the exactly correct values or always predict high (or low) by the same amount, this correlation will be perfect, such as $y = (1,2,3)$ yet $\hat y = (101, 102, 103)$. That such egregiously poor performance can be missed by this statistic makes it of questionable utility for model evaluation (though it might be useful to flag a model as having some kind of systemic bias that can be corrected). When we use a linear model fit with OLS (and use an intercept), such in-sample predictions cannot happen. When we deviate from such a setting, all bets are off.

However, Minitab appears to take the stance that $R^2$ is calculated according to idea #3.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

(This could be argued to be the Efron pseudo $R^2$ mentioned in the comments.)

This means that Minitab takes the stance, with which I agree, that $R^2$ is a function of the sum of squared errors, which is a typical optimization criterion for fitting the parameters of a nonlinear regression. Consequently, any criticism of $R^2$ is also a criticism of SSE, MSE, and RMSE.

I totally disagree with the following Minitab comment.

As you can see, the underlying assumptions for R-squared aren’t true for nonlinear regression.

I assumed nothing to give the above formula except that we are interested in estimating conditional means and use square loss to measure the pain of missing. You can go through the decomposition of the total sum of squares (denominator) to give the "proportion of variance explained" interpretation in the linear OLS setting (with an intercept), sure, but you do not have to.

Consequently, I totally disagree with Minitab on this.

+1 Is there a name for $\text{corr}(y,\widehat{y})$? It seems so intuitive, and "Duh!", and yet I am not recognizing it (although, obvs, I recognize $y_i - \widehat{y}_i$). — Alexis, Mar 30 '23 at 18:02
@Alexis Unfortunately, the best I think I can do is to say that $r^2 = \left(\text{corr}\left(y,\hat y\right)\right)^2$. Yikes! — Dave, Mar 30 '23 at 18:03

score 2 · Answer 2 · answered Mar 30 '23 at 19:03

Taking the other side: The $R^2$ in OLS has a number of definitions and interpretations that are endemic to OLS. For instance, a "perfect fit" has $R^2 = 1$ and, conversely, a "worthless" fit has $R^2 = 0$. In OLS the $R^2$ is interpreted as a "proportion of 'explained' variance" in the response. It also has the formula $1 - SSR/SST$

You say "non-linear regression" but I think you mean generalized linear models. These are heteroscedastic models that not only transform the response variable, but also express the mean-variance relationship explicitly, such as in a Poisson regression where the variance of the response is proportional to the mean of the response. Contrast this with non-linear least squares where the $R^2$ continues to be a very useful metric.

So if we consider GLMs, none of the interpretations we enjoy about the $R^2$ are valid.

A "perfect" fit will not necessarily perfectly predict all observations at every observed level. So, the theoretical upper bound may be some value less than 1. Adding a predictor to a model does not optimally improve the $R^2$ in terms of that predictor's contribution: non-linear least squares would do that.
The probability model for a GLM does not invoke a "residual" per se, (or methods that do do not treat the residual as normally distributed). So neither does the formula make any sense nor can it be interpreted as a fraction of "explained" variance.
While incremental increases in the $R^2$ indicate improved predictiveness, you can't be guaranteed of the scales or unit differences. For instance, if two candidate predictors $u,v$ increase $R^2$ by 5% when added as separate regressors in separate models, the first, $u$ may predict variance really well in the tails but overall be a very lousy predictor and have disappointingly non-significant resluts, whereas the second, $v$, may not appear to improve predictions much, but when accounting for areas with low variance, the overall contributions are substantially better and corroborate statistical significance.
Applying $R^2$ in a GLM regardless is called a pseudo $R^2$.

In that regard, the GLM has a much more useful statistic, the deviance, which even R reports as a default model summary statistics. The deviance generalizes the residual for an OLS model, which has an identity link and gaussian variance structure. But for models such as Poisson the expression is:

$$ D = 2 (y \log y \hat{y}^{-1} - y - \hat{y})$$

+1 I like the generalizations of $R^2$ to comparisons on whatever metric of interest, such as McFadden’s pseudo $R^2$ comparing log loss of a logistic regression (or whatever) to the log loss of a model that predicts the prior probability every time. I find it useful to think in terms of a comparison to such baseline, “must-beat” performance. — Dave, Mar 30 '23 at 19:07
@Dave agreed! And it should be emphasized that, while benchmarking objectives should be a part of every analysis, you can't universally apply "cutoffs" to every analysis. Not the 0.05 level for significance, and definitely not thresholds such as 0.50 or 0.80 for AUCs, R^2, etc. — AdamO, Mar 30 '23 at 19:09
Not clear to me from the original question why nonlinear regression means GLMs, rather than, say, nonlinear least squares regression or similar? — Alexis, Mar 30 '23 at 21:49
@Alexis I poked through the Spiess article and, I just completely disagree with their point. They discuss NLS for qPCR data without a log transform, so it's not clear - and I'm leaning toward no - that they accounted for heteroscedasticity. What are we even supposed to do with this? https://bmcpharma.biomedcentral.com/articles/10.1186/1471-2210-10-6/figures/1 Minitab and Spiess are wrong. — AdamO, Mar 30 '23 at 22:06

Using R^2 in nonlinear regression

2 Answers2

Linked