Why is R2 not reported for GLMs based on last iteration of IRLS weighted least square regression with which it is fit

Question

Given that GLMs are generally fit using iteratively reweighted least squares (based on a Fisher scoring algorithm to maximize the max likelihood objective, which is a variant of Newton-Raphson, see this post) and that the last iteration hence is just a weighted LS fit, why is it that GLMs do not return as a measure of the quality of the fit the R2 based on the weighted MSE derived from this weighted LS fit (as in the get_r2_ss function given here, https://gist.github.com/dhimmel/588d64a73fa4fef02c8f)? After all, it's also this weighted LS fit that the Wald type p values of a GLM are based on anyway. And given that the IRLS algorithm in each iteration works with a Taylor approximation to approximate the full maximum likelihood function objective, I would think that this should come down to basing the R2 on a pointwise quadratic approximation around the optimized maximum likelihood objective, on a transformed scale to make the errors normally distributed and taking into account optimized weights to take into account the mean/variance relationship on that transformed scale. It would seem that one could regard it as a pointwise version of the deviance-based pseudo R squared measure 1-DRES/DTOT=(LLM-LL0)/(LLS-LL0)=LLM*((1-LL0/LLM)/(LLS-LL0)) where DRES=residual deviance, DTOT=null deviance, LLM, LL0 and LLS=maximized log-likelihood of your model, an intercept-only null model and a fully saturated perfectly fitting model, also known as D squared or the explained deviance, I think due to Cameron 1993 (different from McFadden's, as it is usually interpreted, as well as Cox & Snell's, as they both ignore the log-likelihood of the saturated model LLS, or assume it is 0), evaluated around the maximum likelihood optimum of your model.

To give a concrete example, a GLM would be fit using iteratively reweighted least squares as follows:

glm.irls = function(X, y, family=binomial, maxit=25, tol=1e-08, beta.start=rep(0,ncol(X))) {
    if (is.function(family)) family <- family()
    beta = beta.start
    for(j in 1:maxit)
    {
      eta    = X %*% beta
      g      = family$linkinv(eta)
      gprime = family$mu.eta(eta)
      z      = eta + (y - g) / gprime # = "observed adjusted responses"
      W      = as.vector(gprime^2 / family$variance(g)) # weights
      betaold   = beta
      wlmfit    = lm.wfit(x=X, y=z, w=W) # weighted LS fit of "adjusted response" z on X
      beta      = as.matrix(coef(wlmfit),ncol=1) # weighted LS fit = solve(crossprod(X,W*X), crossprod(X,W*z))
      z_pred    = wlmfit$fitted.values # = "fitted adjusted responses"
      if(sqrt(crossprod(beta-betaold)) < tol) break
}
if (ncol(X)>1) { wlmfit = lm(z~1+X[,-1], weights=W, x=T, y=T) } else { wlmfit = lm(z~1, weights=W, x=T, y=T) }
list(coefficients=beta, iterations=j, z=z, z_pred=z_pred, weights=W, X=X, y=y, wlmfit=wlmfit)
}

Here I also return along with the coefficients the weighted least square fit of the last IRLS iteration. So why can we not just look at the R2 of this weighted LS fit as a measure of the quality of the GLM fit? This would correctly take into account the fact that we are working on a transformed link scale AND take into account the correct mean/variance relationship. So I don't quite buy the concerns or issues with R2 values for GLMs raised here. Or am I missing something?

For example for logistic regression:

data("Contraception",package="mlmRev")
R_GLM = glm(use ~ age + I(age^2) + urban + livch,
            family=binomial, x=T, data=Contraception)
R_GLM0 = glm(use ~ 1,
            family=binomial, x=T, data=Contraception) # intercept only model

Some commonly reported pseudo R2 values:

library(DescTools)
PseudoR2(R_GLM, which="all")
# McFadden        McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann          Effron McKelveyZavoina 
# 6.686859e-02    6.146508e-02    8.568618e-02    1.160955e-01    2.650769e-02    9.160375e-02    8.541285e-02    1.164111e-01 
# Tjur             AIC             BIC          logLik         logLik0              G2 
# 8.565253e-02    2.431659e+03    2.470630e+03   -1.208829e+03   -1.295455e+03    1.732505e+02
1-logLik(R_GLM)/logLik(R_GLM0) # McFadden's pseudoR2=0.06686859
DTOT = summary(R_GLM)$null.deviance # null deviance 
DRES = summary(R_GLM)$deviance # residual deviance
1-DRES/DTOT # deviance-based R2=D2=explained deviance, here equal to McFadden's pseudoR2 = 0.06686859 since LLS, the LL(saturated model) for logistic model=0

Now the GLM model estimated with our IRLS implementation and R2 calculated from the final weighted LS iteration of the GLM fit would give us:

IRLS_GLM = glm.irls(X=R_GLM$x, y=R_GLM$y, family=binomial)
print(data.frame(R_GLM=coef(R_GLM), IRLS_GLM=coef(IRLS_GLM))) # coefficients match with glm output
summary(IRLS_GLM$wlmfit)$r.squared # 0.07305641

The returned R2 (0.07305641) is close to McFadden's (which here for the binomial case equals the deviance-based R2 value 1-DRES/DTOT, i.e. explained deviance, since LLS=LL(saturated model)=0 for the logistic case), Efron's and Cox & Snell's but not quite identical. The GLM R2 thus calculated would be derived as

z = IRLS_GLM$z # observed data on "adjusted response" scale
z_pred = IRLS_GLM$z_pred # fitted data on "adjusted response" scale
w = IRLS_GLM$weights # 1/variance weights on "adjusted response" scale
ss_residual = sum(w * (z - z_pred) ^ 2)
ss_total = sum(w * (z - weighted.mean(z, w)) ^ 2)
r.squared = 1 - ss_residual / ss_total # = 0.07305641

which is also equivalent to

the R2 of a weighted OLS regression of the predicted on the observed adjusted response,

summary(lm(IRLS_GLM$z_pred~IRLS_GLM$z, w=IRLS_GLM$weights))$r.squared # = 0.07305641

the squared weighted Pearson correlation between the observed and predicted adjusted response (provided the regression of z_pred on z is positive)

boot::corr(d=cbind(IRLS_GLM$z, IRLS_GLM$z_pred), w=IRLS_GLM$weights) ^ 2 # = 0.07305641

the explained deviance D2 of a weighted OLS regression of the predicted on the observed adjusted response, where the difference with the traditional D2 measure then is merely what you take as your null model and as your fully saturated perfectly fitting model, which is resp. an intercept-only model drawn through your observed adjusted responses and a perfect fit to the adjusted responses in my GLM R2, whereas in the traditional D2 measure for GLMs, the null model would be taken to be an intercept-only GLM fitted through your original responses (which in terms of variance assumptions would always be very far from the mark) and a GLM with as many covariates as observations (which would of course be massively overfitted and very unstable) :

GLM_z_pred_z = glm(IRLS_GLM$z_pred~IRLS_GLM$z, w=IRLS_GLM$weights) 1-summary(GLM_z_pred_z)$deviance/summary(GLM_z_pred_z)$null.deviance # = 0.07305641

For a family with an identity link, the "adjusted response scale" z = eta + (y - g) / gprime is just equal to the expected response X %*% beta = yhat and the weights W = gprime^2 / family$variance(g) would be equal to 1/family$variance(X %*% beta), i.e. one over the variance of the expected response yhat. Our GLM R2 measure we have would then simply be the explained variance in the expected response yhat, taking into account the mean-variance relationship on the response scale. E.g. for an identity link Poisson GLM, e.g. the adjusted response would be equal to the raw responses and the weights would be 1/variance=1/yhat. The way in which we would calculate R2 would be identical to the way in which it would be calculated for any weighted least squares analysis and giving less weight in the R2 measure to the SS with greater variance is of course the logical thing to do. But in general, the R2 would work for any exponential family and link function - the weights would then be 1/variance weights on the adjusted response scale. One could also calculate it no matter if the GLM itself had or had not been estimated via IRLS. If one wanted to one could also develop an R2 measure that worked on the original response scale by checking how the 1/variance weights on the transformed adjusted response scale would translate back to 1/variance weights on the original response scale using the delta method. But this would be less good because of the asymmetry in the errors in GLMs due to the bounded responses, which the link scale resolves. A similar recipe should also apply to GLS models.

My immediate question is if anybody in the statistics community published the use of an R2 / goodness of fit value defined like this for GLMs. The R2 as defined like this does also obviously reduce to the correct R2 expected for a regular OLS if Gaussian error would be used. Note that based on this requirement alone only my GLM R2, explained deviance D2, Cox & Snell's R2 and Efron's R2 would be OK. McFadden's e.g. isn't as it ignores the saturated model log-likelihood, making it only valid for logistic regression. Cox & Snell's also isn't valid in general as it also ignores the satured model log-likelihood and merely rescales its value to make it match the R2 of a regular OLS with gaussian noise. Finally, Efron's is not valid in general because even though it is based on 1-RSS/TSS like my GLM R2 it works on the linear response scale, which has asymmetrical non-gaussian errors, plus it ignores the variance weights (at best it should use weights calculated from the weights on the adjusted response scale via the delta method).

Numerical demonstration, first the OLS R2=0.01382265:

data(iris)
R_OLS = lm(Sepal.Length ~ Sepal.Width, data=iris)
summary(R_OLS)$r.squared # OLS R2 = 0.01382265
R_OLS_GLM = glm(Sepal.Length ~ Sepal.Width, family=gaussian(identity), x=T, data=iris)

Various PseudoR2 measures and the explained deviance D2 (only explained deviance D2, Cox & Snell's R2 and Efron's R2 match OLS R2) :

library(DescTools)
PseudoR2(R_OLS_GLM, which="all")
# McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann          Effron McKelveyZavoina 
# 5.672311e-03   -5.194906e-03    1.382265e-02    1.512261e-02              NA              NA    1.382265e-02              NA 
# Tjur             AIC             BIC          logLik         logLik0              G2 
# NA    3.719917e+02    3.810236e+02   -1.829958e+02   -1.840398e+02    2.087861e+00
# DTOT = summary(R_OLS_GLM)$null.deviance # null deviance 
# DRES = summary(R_OLS_GLM)$deviance # residual deviance
# 1-DRES/DTOT # deviance-based R2=D2=explained deviance=0.01382265 = here equal to regular OLS R2 & Efron's & Cox & Snell's but not McFadden's (since that one assumes log-likelihood of saturated model=0, which here is not the case)  !

My GLM R2 obviously matches the expected OLS R2, since z=y and all weights are equal to 1 with family=gaussian(identity) which causes the IRLS algorithm to convergence in 1 iteration, as there is no weights to optimize:

IRLS_OLS = glm.irls(X=R_OLS_GLM$x, y=R_OLS_GLM$y, family=gaussian(identity), beta.start=coef(R_OLS_GLM))
print(data.frame(R_OLS=coef(R_OLS_GLM), IRLS_OLS=coef(IRLS_OLS))) # coefficients match
summary(IRLS_OLS$wlmfit)$r.squared # 0.01382265 - matches R2 of regular OLS model and Cox & Snell's pseudo R2 (1-(LL_intercept_model / LL_full_model) ^ (2 / n)) & Efron's pseudo R2 & 1-DRES/DTOT, but not McFadden's (1-LL_full_model/LL_intercept_model) !

So why is no one seemingly using this GLM R2 value as defined above, even though it would seem to have some nice properties, (1) reducing to the expected value for regular OLS regression / gaussian GLMs, (2) working on a transformed scale where errors are asymptotically normal, so that working with sums of squares makes sense, (3) taking into account heteroscedasticity on this scale, so that we take into account the expected mean/variance relationship on this adjusted response scale, (4) it being maximized by the maximum likelihood objective of the GLM fit (this is also true for explained deviance D2), and (5) just requiring the fitting of a single model as opposed to your model, a null model & a saturated model in the deviance-based R2 12? What would be the objection of using it as a measure of the goodness of fit for a GLM fit? And can one see it as a pointwise estimate of the deviance-based pseudo R squared measure 1-DRES/DTOT=(LLM-LL0)/(LLS-LL0)=LLM*((1-LL0/LLM)/(LLS-LL0)) where DRES=residual deviance, DTOT=null deviance, LLM, LL0 and LLS=maximized log-likelihood of your model, an intercept-only null model and a fully saturated perfectly fitting model, also known as D squared or the explained deviance, I think due to Cameron 1993?

My GLM R2 measure would have the interpretation that it is the weighted explained variance in the adjusted response scale, which would be equivalent to measuring how much better one can predict the adjusted responses based on knowing your covariates compared to if you just had an intercept only model. This is directly analogous to a regular OLS R2 with the only difference that we are looking at this at a transformed "adjusted response" (z) scale on which the errors are approximately gaussian and symmetric and taking into account any heteroscedasticity on that scale. The interpretation of the deviance based R2 measure 1-DRES/DTOT, by contrast, measures how much better the log-likelihood of your model becomes by knowing your covariates compared to fitting an intercept-only model to your data (which in the case of a GLM would then make variance assumptions that would be very far from the mark), proportionally to a fully saturated model (which would be massively overfitted, as that would fit one covariate per observation).

Based on the 3 interpretations that an ideal R2 measure for GLMs should possess outlined here, (1) R-squared as explained variability, (2) R-squared as an improvement from an intercept-only model to your fitted model and (3) R squared as the square of the correlation as well as (4) the requirement that the R2 should be maximized by the maximum likelihood objective of your fit, it would be the case that my R2 GLM measure would have all four interpretations or properties, which is probably the only measure I've seen so far that would have this: (1) it is the explained variability in your expected reponse on a transformed "adjusted response" (z) scale where errors are asymptotically normal, (2) it measures how much better these adjusted responses can be predicted compared to just taking the (weighted) average of these adjusted responses, which would be the value you would get with an intercept-only model fitted on these adjusted responses, (3) it is also equal to the square of the Pearson correlation between the observed and predicted adjusted responses and (4) it is maximized by the ML objective of the GLM (since like explained deviance D2, it would just be a linearly rescaled version of the log-likelihood of your model). None of the other proposed measures have all these features. Based on this, I would call this the R2 of the model, and not just yet another pseudo R2 measure... And given that it would make intuitive sense, I find it extra weird that R, STATA and SAS all can output various pseudoR2 measures for GLMs (e.g. McFadden's but using a definition that ignores the log likelihood of the saturated model), except seemingly the only two sensible ones, which I would think would be either explained deviance D2 or the one I describe above...

One thing that some people might find troubling is that e.g. for logistic regression the maximum of our GLM R2 would not be 1, but personally I feel that this would be logical as e.g. the binarized scale with which you take your measurements in case of logistic regression would obviously imply that your points couldn't coincide with your model predictions (either on the adjusted response scale or the original linear response scale). It's only if you would have knowledge about the true underlying model (e.g. in case of simulated data) that one could e.g. calculate the explained variance of a weighted OLS regression of the true adjusted responses on the predicted (fitted) adjusted responses, using the true known weights, i.e. using summary(lm(z_true~z_pred, w=w_true))$r.squared, where w_true=1/true_variance that one would have a measure where the R2 could be truly 1. Again, I was wondering if anyone has done this in the literature?

In general what I am asking is "Given that any GLM is fit using iteratively reweighted least squares, why can one not use that weighted least squares fit to produce a valid and useful R2 value - it seems such an obvious thing to do, that it would be hard to believe no one did this before?".

EDIT: For a concrete example of where I thought an R2 would be useful for GLMs and would make sense: I was fitting nonnegative identity link Poisson GLMs with inbuilt regularisation to do signal deconvolution. The method uses an iterative adaptive ridge regression procedure, which allows one to choose the level of regularisation lambda so that BIC will end up being optimised & one can approximate best subset selection). My simulated signal is a spike train blurred with a given point spread function with Poisson noise on it. The deconvolution I do by regressing the observed signal onto a banded covariate matrix that contains time-shifted copies of the known point spread function, using nonnegativity constraints on the coefficients.

Here an example:

library(remotes)
remotes::install_github("tomwenseleers/L0glm/L0glm")
library(L0glm)
library(export)
simulated dataset: blurred superimposed spike train (spike train convoluted with given point spread function with Poisson noise)
with Poisson noise on it
set.seed(1)
sim <- simulate_spike_train(n = 100, # nr of observations 
                            p = 100, # nr of variables
                            k = 10, # true nr of nonzero coefficients
                            family = "poisson", 
                            mean_beta = 30, # true nonzero coefficients have lognormal distribution
                            sd_logbeta = 1)
X <- sim$X # design matrix = temporally shifted copies of known point spread function
y <- sim$y # observed signal (with Poisson noise)
beta_true <- sim$beta_true # true coefficients used in simulation
y_true <- sim$y_true # true underlying signal without Poisson noise
weights_true <- 1/y_true # true 1/variance weights for identity link Poisson case
coefficients estimated by L0 glm (approximating GLM that optimises BIC, i.e. best subset)
coefs <- L0glm.fit(X, y, 
                       family = poisson(identity), 
                       lambda = preset.lambda(IC = crit, # use lambda that optimises given IC, options aic, bic, gic, ebic, mbic
                                              y = y,
                                              X = X,
                                              family = poisson(identity)),
                        # lambda = 100,
                        nonzero.centered = TRUE, # use nonzero centered ridge regression for iterative adaptive ridge regression updates to approximate best subset
                        nonnegative = TRUE # enforce coefficients to be nonnegative
)$coefficients
fitted values
y_pred <- as.vector(coefs %*% sim$X)
estimated 1/variance weights
weights_est <- 1/y_pred
points(sim$x, coefs, pch = 16, col= 'blue') # estimated coefficients
graph2png(file="..//benchmarks/plots/single channel deconvolution Poisson.png", width=7, height=5, dpi=150)

Signal (black), true coefficients (red) & estimated coefficients (blue) then look like this :

The weighted R2 and adjusted R2, using 1/variance weights (1/expected value for identity link Poisson), I would define like this

# define functions to calculate weighted R2 and adjusted weighted R2 :
get.R2 = function (z_obs, # observed data on adjusted response scale
                   z_pred, # fitted data on adjusted response scale
                   weights, # weights from last IRLS GLM iteration, e.g. 1/variance for identity link GLMs
                   intercept=FALSE) { 
  good = as.vector(weights<1) # to remove infinite & unreliable weights
  ss_residual = sum(weights[good] * (z_obs[good] - z_pred[good]) ^ 2)
  if (intercept) { wmean = weighted.mean(z_obs[good], weights[good]) } else { wmean = 0 }
  ss_total = sum(weights[good] * (z_obs[good] - wmean) ^ 2)
  r.squared = 1 - ss_residual / ss_total 
  return(r.squared) }
get.adjR2 = function (z_obs, # observed data on adjusted response scale
                      z_pred, # fitted data on adjusted response scale
                      weights, # weights from last IRLS GLM iteration, e.g. 1/variance for identity link GLMs
                      intercept=FALSE,
                      p # nr of variables in the model (nr of nonzero coefficients) (incl intercept)
                      ) { 
  good = as.vector(weights<1) # to remove infinite & unreliable weights
  n = sum(good) # nr of observations
  ss_residual = sum(weights[good] * (z_obs[good] - z_pred[good]) ^ 2)
  if (intercept) { wmean = weighted.mean(z_obs[good], weights[good]) } else { wmean = 0 }
  ss_total = sum(weights[good] * (z_obs[good] - weighted.mean(z_obs[good], weights[good])) ^ 2)
  df_residual = n-p-1
  df_total = n-1
  r.squared = 1 - (ss_residual/df_residual) / (ss_total/df_total) 
  return(r.squared) }

For identity link functions the adjusted response scale z is the same as the response scale y, which means that for the adjusted weighted R2 between predicted signal & observed signal and using the estimated 1 over variance weights this would give us an adjusted R2 = 0.96794 :

get.adjR2(z_obs = y,
          z_pred = y_pred,
          weights = weights_est,
          p = sum(coefs!=0),
          intercept = FALSE) 
# 0.96794

Ideally, what I would like to have is an unbiased estimator of the weighted R2 between the predicted signal & the true underlying signal without noise. Here this would give a weighted R2 of 0.9967505. Question is whether the adjusted weighted R2 between predicted signal & observed signal is an unbiased estimator of this? For a wide range of lambda values (both low or very high lambdas) it seems it is in the right ballpark, but for this particular lambda & when R2 gets really close to 1 it is higher than the estimated value above. I presume this is because for the observed signal sim$y we only have access to 1 realisation of the Poisson noise & the discrete nature of the noise would cause R2 to be underestimated?

# weighted R2 between predicted signal & true underlying signal without noise, using true 1/variance weights from simulation
get.R2(z_obs = y_true, 
       z_pred = y_pred,
       weights = weights_true,
       intercept = FALSE)
0.9967505

I don't know about in general, but for logistic regression I use a thing called Tjur's $R^2$ It was mentioned in Hosmer's book and it's available in the R package PseudoR2. That packages lists several closely related definitions. — olooney, Jun 11 '19 at 22:38
Yes Tjur's is also listed above in the output of DescTools' PseudoR2 - but my question is more general, namely given that any GLM is fit using iteratively reweighted least squares, why can one not use that weighted least squares fit to produce an R2 value? — Tom Wenseleers, Jun 11 '19 at 22:40
@Tom Wenseleers: The accepted answer for the linked question gives an answer that pertains to GLMs. Have a look at that answer and see if it resolves your question. If not, please let us know. — Ben, Jun 12 '19 at 09:42
"Kvalseth (1985) argued that it’s actually preferable that R2 not be based on a particular estimation method. In that way, it can legitimately be used to compare predictive power for models that generate their predictions using very different methods." https://statisticalhorizons.com/wp-content/uploads/GOFForLogisticRegression-Paper.pdf — jsk, Jun 12 '19 at 10:35
@jsk Yes, but I am sure that the R2 measure above could still be derived in different ways, even if one fitted the model using different methods. In the same way that say Wald confidence intervals can be derived from the variance-covariance matrix of the weighted least squares fit used to fit a GLM or as a quadratic approximation to the actual maximum likelihood function. — Tom Wenseleers, Jun 12 '19 at 11:48
Yes, but onfidence intervals are evaluated based on coverage probability, not how they are derived. R2 starts with total variability around $\bar{Y}$ as the baseline amount of variability. However, your version changes the baseline total variability because of the weights changing due to the assumed mean variance relationship and the estimated means. You are chancing how much weight is placed on predicting different points, whuch means you couldn't really compare your R2glm across models in terms of predictive power. — jsk, Jun 12 '19 at 15:37
If we could agree on what makes a pseudo R2 measure useful and valid, then we could address whether or not the proposed method satisfies those criteria. — jsk, Jun 14 '19 at 23:53
Edited my question to clarify that - I think what we would have here would not just be merely a "pseudo R2" btw, rather it would be the only R2 that would make sense for a GLM and would be the direct analog of the R2 of an OLS regression. It would also have all the traditional required properties of an R2. — Tom Wenseleers, Jun 15 '19 at 07:21
Thanks for the edits and the additional links. If we accept the desired properties in the links, then perhaps it does satisfy some of those properties. One of the weird things though is "improvement from null model". In your case though, the weights are applied both to the SS for the fitted model as well as the null model. So, as you change your model, you are changing the baseline measure that you are comparing it to, which seems very bizarre. The likelihood based methods don't have this inconsistency nor does Efron's. — jsk, Jun 15 '19 at 10:13
Have you reviewed papers that discuss the calculation of R2 in the case of weighted least squares? http://gseacademic.harvard.edu/~willetjo/pdf%20files/Willett_Singer_AS88.pdf It looks like there are potentially some arguments in the literature against using the proposed method even in that scenario. — jsk, Jun 15 '19 at 10:21
Well that paper doesn't make much sense to me - all those R2's proposed in Kvalseth's (1985) just seem to drop out of the air without any logical argumentation whatsoever... Starting from Kvalseth's R2, which no one uses and doesn't have any generality, Willett & Singer then plugged weights into Kvalseth's formula, which they feel doesn't behave as it should, and then go on to use something that doesn't use Xsqrt(W) and Ysqrt(W), which clearly can't be right... (that pseudoR2 isn't even at a maximum at the WLS estimated beta coefficients) — Tom Wenseleers, Jun 15 '19 at 17:12
Here is at least several generally defined R2s with strong theoretical foundations, which are maximized by the original WLS objective, and which all give the same result for WLS : https://gist.github.com/dhimmel/588d64a73fa4fef02c8f The deviance based R2 formula of course also works, and also gives the same result. — Tom Wenseleers, Jun 15 '19 at 17:15
@SextusEmpiricus Well R2 is explained variance of a linear model under asymptotic normality. But in a GLM we are still doing a quadratic approximation of the maximum likelihood function & assuming asymptotic normality at the adjusted z scale. So my question is why also not report the explained variance at that scale & use that as a fit metric? (Or adjusted R2, correcting for the nr of parameters in the model) And would that not make more sense than all those other suggested pseudo R2 values for GLMs? — Tom Wenseleers, Jun 20 '23 at 15:08
@SextusEmpiricus I think the R2 value calculated like this would remain the same if one fitted a binomial GLM as Bernouilli yes... — Tom Wenseleers, Jun 20 '23 at 15:10
@SextusEmpiricus Well for the moment I was fitting nonnegative identity link Poisson GLMs, and there I found such an R2 or adjusted R2 measure at least much more intuitive than deviance, log likelihood or AIC or BIC... Wouldn't you agree? At least it allows for a clear interpretation of goodness of fit in absolute terms... — Tom Wenseleers, Jun 20 '23 at 15:13
@SextusEmpiricus OK fine, but I mean given that an identity link Poisson GLM is like a regular weighted least square regression (or GLS if you like) with 1/variance weights, would it not just make as much sense to use (weighted) R2 there than for data with Gaussian noise? — Tom Wenseleers, Jun 20 '23 at 15:20
@SextusEmpiricus Well but none of this addresses my question. I am asking - accepting that (adjusted) R2 by most is considered a useful fit metric for linear models with Gaussian errors, what is the objection to extending the concept to GLMs by working with the explained variance on the adjusted z scale instead? An answer "R2 is actually never of any use" is not the answer I'm looking for then... — Tom Wenseleers, Jun 20 '23 at 15:31
@SextusEmpiricus But why not - given that asymptotically, due to the central limit theorem, data are on that adjusted z scale also asymptotically normal? In fact, that's why one can get away testing for the significance of GLM coefficients using regular z tests also (with test statistic calculated as coefficients over their SE)? — Tom Wenseleers, Jun 20 '23 at 15:43
@TomWenseleers You might be making a common incorrect claim about the CLT. The data do not tend toward normality as the sample size increases. — Dave, Jun 20 '23 at 15:48
@Dave OK maybe I'm not putting this correctly. But can anyone explain me given that R2 makes sense for linear models with gaussian noise, why a weighted R2 would not make sense for an identity link Poisson GLM with 1/variance (1/the expected value) weights? — Tom Wenseleers, Jun 20 '23 at 15:57
@TomWenseleers it is not so much the Poisson distribution, but more the type of experiments that create Poisson distributed data for which correlation makes less sense. (In a way Poisson distributions are actually not that bad and people do use $R^2$ with them a lot. Although the usefulness can be debated.) But for something like binomial distributed data the interpretation becomes a lot less clear. For example when we have a correlation between gender (100 men 100 women) and disease (10 sick and 190 healthy) then the squared correlation can be at most 0.05 — Sextus Empiricus, Jun 20 '23 at 19:02
More a qualitative attempt: i would compare goodness of fit with the assumption of equal variances. Which means you can have a good fit with low R2 and the other way around. I think Chi Square can be compared to a check on equal variances. — user390540, Jun 16 '23 at 22:04
@SextusEmpiricus I have added a concrete case of an identity link Poisson GLM I was working with where I thought some R2 measure would be useful - do you think what I wrote makes sense? — Tom Wenseleers, Jun 21 '23 at 11:59
No offense intended, but this question is too long in my opinion — Firebug, Jun 21 '23 at 12:12
@TomWenseleers in that application that you added in your latest edit, one might wish to reduce the least squares error (which is indirectly an increase of the $R^2$) and one might aim to have a model that 'explains' the variance in the signal. But the interpretation is not always so clear. If you sample a Poisson distribution, then you will always have some variance due to noise, a variance that the deterministic model is not required (or not even supposed) to explain. ... — Sextus Empiricus, Jun 21 '23 at 12:21
... This can become tricky in the comparison of different experiments with different S/N ratio's. If you increase the duration of measurement (to get a stronger signal), then your signal increases and so will the noise. But, say your model scales with a factor $a$, then the variance of the noise scales with a factor $a$ but the variance of the signal scales with a factor $a^2$ and so the $R^2$ value will change without modifying the model. The $R^2$ will not be only a reflection of the quality of a model, bit also about the signal to noise ratio. That can be a confusing. — Sextus Empiricus, Jun 21 '23 at 12:21

Dave · Answer 1 · 2023-06-20T16:10:20.227

It is not obvious what should be considered $R^2$ in a GLM. For instance, in their discussion about just logistic regression, UCLA gives eight possibilities, all of which mimic the usual $R^2$ in some desirable way. If you do a Poisson regression or some other GLM, there are additional possibilities, and it is not obvious what you should calculate. My preference would be to use a comparison of your model performance to the performance of a baseline model, but it is not clear on what criteria the models should be evaluated. Log-likelihood makes a lot of sense to me, but so does square loss (at least if we call the value $R^2$).
The use is not clear. Yes, at first, $R^2$ seems like a way of assigning a letter grade like in school where $R^2=0.95$ is an $A$ that makes us happy while $R^2=0.41$ is an $F$ that makes us sad. However, assessments out of context are fraught with problems. The answer linked here does an outstanding job, I believe, of discussing this and even does it in the context of $R^2$ and not some other measure of performance. Therefore, an $R^2$-style statistic has the potential to confuse matters by leading someone to dismiss a model with $R^2=0.41$ as being an $F$-quality model when it might be among the best scores anyone has ever achieved.
It would be an in-sample assessment of performance that can be made very high by fitting to the training data at the expense of generalizability. While an adjusted $R^2$ has a reasonable motivation in the OLS linear model setting, it is not even clear what should be compared, let alone how to calculate unbiased estimators.
If you just calculate the squared correlation between the predicted and true values, you can wind up seeing a perfect fit according to this correlation despite the fitted values being awful. I give simulations here where this happens (squared correlation is $1$, graph shows terrible predictions), and there can be examples that are less extreme.

Keeping in mind the above, an $R^2$ value for a GLM seems to be a rather advanced idea. If someone had a reason for wanting such a value, they must know statistics in some detail, and such a person likely has the ability to program a function to do the exact calculation they want.

Thanks for this - I added a concrete case where I was working with an identity link Poisson GLM and where I thought some R2 measure would make intuitive sense. I used adjusted R2 there, to also take care of your concern of overfitting, and also used a weighted R2 to compare the fitted values with the true underlying signal used in the simulation (without noise), which is ideally what I would like to estimate from my data. — Tom Wenseleers, Jun 21 '23 at 12:02
@TomWenseleers That all sounds like it could be very wonderful work. It’s not clear that this is so straightforward to understand that it should be included as a default output in software under the name $R^2$, however. If you want to consider this to be something like “Wenselers’s pseudo $R^2$” for a Poisson regression, that’s fine. I see value to McFadden’s pseudo $R^2$ in a logistic regression, so why can’t you find yours valuable for a Poisson regression? — Dave, Jun 21 '23 at 13:06
Agreed - can't quite make up my mind myself... And sorry my question now became ridiculously long. :-) — Tom Wenseleers, Jun 21 '23 at 13:09

Sextus Empiricus · Answer 2 · 2023-06-20T17:08:45.610

An answer to 'why no $R^2$ for glm?' may be found in questions like 'what is actually the use of $R^2$?'.

I can see $R^2$ as

A measure of correlation, a coefficient of determination. And this is the origin of the variable (Who is the creator or inventor of coefficient of determination (R-squared)?) with scientist like Sewall Wright using it to describe heradity.

For distributions that are not normal, correlation coefficients start to loose interpertability. See for instance Correlations between continuous and categorical (nominal) variables or Bernoulli random variables and correlation coefficient
A link with the likelihood of normal distributed data.

Both cases are not an argument that it should also be used for GLM.

Tricky issue with glm 1

If you want the coefficient of determination, then you can always compute it.

However, for most glm models it makes little sense to compute it and also the $R^2$ is not even always equal.

Say I have coin flips coin 1 HHTTH coin 2 HTTHT which I model with equal probability heads tails as null model or with p=0.4 and p=0.6, then my $R^2$ can be nearly zero $$1-\frac{6 \times 0.4^2+4 \times 0.6^2}{10 \times 0.5^2} = 0.04$$ and it can be one $$1-\frac{(2-2)^2+(3-3)^2}{(2.5-2)^2+(2.5-2)} = 1$$

One can argue that this is also true for normal distributed variables, if one takes sums or averages of multiple cases, then $R^2$ will be reduced as the noise levels reduce due to the averaging.

But for the coefficient of determination one is often interested in the relationship/correlation between individual cases from the population. E.g. like the correlation between the height of mom+dad with their children. And not for averages of such variables. Problems described with glm models are not often of that type (comparing correlations).

Tricky issue with glm 2

For cases that are tackled with a GLM the relationship of the residual variance is often of less importance, or at least complicated , as there is already an intrinsic variance (like the coin flips above). The correlation between the data might be less important when there is some intrinsic variance in the model that can not be eliminated.

In GLM the variance is even explicitly modelled where the variance is a function of the expected value. When we apply GLM then we work with problems of a different nature. For example, if you have Poisson distributed data, then you won't be aiming for high $R^2$ as there is an intrinsic unexplained variance in the model. This is why people may rather look at something like a comparison with the deviance difference between the full model and the null model and the deviance of the tested model.

But let's take the case an identity link Poisson GLM for example - that would be identical to a generalised least squares analysis with 1/variance=1/expected value weights. It would seem entirely reasonable there to use a weighted R2 as explained variance there. What would be wrong doing so? — Tom Wenseleers, Jun 20 '23 at 16:04
@TomWenseleers Watch out for calling it “explained variance” when you do something other that linear regression that minimizes square loss, as the decomposition of the total sum of squares gets funky. This is related, even if not identical. // You can calculate whatever you want; if you motivate your calculation, it might be quite useful. I hope, however, that my answer explains enough about why determining the desired calculation is not so straightforward to warrant a particular calculation by default. — Dave, Jun 20 '23 at 16:11
@TomWenseleers that would relate to the use of $R^2$ as coefficient of determination. If you want that statistic, then you can always compute it. However, for most glm models it makes little sense to compute it and also the $R^2$ is not even always equal. Say I have coin flips coin 1 HHTTH coin 2 HTTHT which I model with equal probability heads tails as null model or with p=0.4 and p=0.6, then my $R^2$ can be nearly zero $$1-\frac{6 \times 0.4^2+4 \times 0.6^2}{10 \times 0.5^2} = 0.04$$ and it can be one $$1-\frac{(2-2)^2+(3-3)^2}{(2.5-2)^2+(2.5-2)} = 0$$ — Sextus Empiricus, Jun 20 '23 at 16:29
Thanks for this - I added a concrete case where I was working with an identity link Poisson GLM and where I thought some R2 measure would make intuitive sense. I used adjusted R2 there, to also take care of the concern of overfitting, and also used a weighted R2 to compare the fitted values with the true underlying signal used in the simulation (without noise), which is ideally what I would like to estimate from my data in an unbiased way. — Tom Wenseleers, Jun 21 '23 at 12:03

Why is R2 not reported for GLMs based on last iteration of IRLS weighted least square regression with which it is fit

simulated dataset: blurred superimposed spike train (spike train convoluted with given point spread function with Poisson noise)

with Poisson noise on it

coefficients estimated by L0 glm (approximating GLM that optimises BIC, i.e. best subset)

fitted values

estimated 1/variance weights

0.9967505

2 Answers2

Tricky issue with glm 1

Tricky issue with glm 2

Linked