What sense does adjusted $R^2$ and deviance explained mean for quantile generalized additive models (QGAMs)?

Question

I've done some reading here in the past, and my basic assumption is that for a generalized additive model (GAM) or a quantile regression (QR), the following is generally true:

For a Gaussian distributed error term, the adjusted $R^2$ in GAMs functions similarly to it's analogue in linear regression, and thus provides an approximation of the explained variance in the outcome attributed to the predictors in the model.
For the generalized case that extends to families like binomial, Poisson, etc. in GAMs, from what I gather, it is better to report the deviance explained, as the adjusted $R^2$ doesn't typically make sense in this setting.
For QR, there seem to be alternative pseudo $R^2$ values, $R^1$, which approximate the adjusted $R^2$ given a specific fitted quantile $\tau$. However, it appears that the creator of the quantreg package Roger Koenker does not have a high opinion of either $R^1$ or $R^2$, as noted here.

For a fitted quantile generalized additive model (QGAM), this slightly complicates things, as this estimates GAMs with a quantile-based approach (see here for details). The model output generally gives both adjusted $R^2$ and deviance explained, but all of the articles I have seen from the package creators don't seem to provide any mention of if these are accurate for quantile fits or how to interpret them in the first place. For now, I plan on providing both estimates from the qgam package and leaving others to interpret them if they are meaningful, but it would be helpful to know what they actually provide, if anything at all, and which to use. My thinking is that because the fitting is non-parametric, perhaps deviance explained is better, but I'm not sure mathematically how that works. It also appears that for some models I have fit in the past, the adjusted $R^2$ values can be tragically low, which I fear will unnecessarily blind reviewers from interpreting my models meaningfully.

Edit

In the comments it was suggested that perhaps $D^2$ may be useful. However, there are two issues present with that. First, I don't know if this captures nonlinear data well. Second, I'm not sure how you obtain this from a qgam fit. As an example of a fit using multiple quantiles, I provide the following example below:

#### Load Libraries ####
library(qgam)
library(MASS)
Fit Multiple Quantile Model
fit <- mqgam(
  accel ~ s(times),
  data = mcycle,
  qu = c(.25,.5,.75)
)
Plot Pinball Loss
check.learnFast(fit$calibr)

Which shows the pinball loss for each quantile of the QGAM:

But I can't determine how that is calculated or how to obtain a $D^2$ statistic from it.

Edit 2

The article from the authors of the package discuss the loss function in some sections, but it doesn't appear exactly the same based off the fact that they used a Bayesian credible interval approach to estimating the splines that involves a "smoothed" version. See quote below:

In particular, QGAMs are based on the pinball loss function (Koenker and Bassett 1978), rather than on a probabilistic model of the response distribution, and the absence of a likelihood function impedes direct application of the Bayes’ rule when updating the prior distribution on the regression coefficients given the observed responses. While this problem can be overcome by adopting the coherent Bayesian belief updating framework of Bissiri, Holmes, and Walker (2016), which effectively leads to the application of Bayes’ rule using a loss-based pseudo-likelihood, naïvely plugging such a pseudo-likelihood into standard Bayesian fitting methods can lead to inaccurate quantile estimates and inadequate coverage of the corresponding credible intervals as discussed, for instance, in Yang, Wang, and He (2016) and Sriram (2015). To avoid such issues, qgam implements the calibrated Bayesian methods of Fasiolo et al. (2021b), which explicitly aim at selecting the “learning rate” tuning parameter of the loss so as to achieve near-nominal frequentist coverage of the quantile credible intervals. Furthermore, qgam bases quantile regression on a smoothed version of the pinball loss, which enables the adoption of fast maximum a posteriori (MAP) and empirical Bayes methods to estimate the regression coefficients and select their prior variance hyper-parameters, respectively. The smoothness of the new loss is determined by minimizing the asymptotic mean squared error (MSE) of the estimated regression coefficients, approximated using a location-scale GAM.

In the paper they reference, they go into more mathematical detail of the smoothed loss that is beyond my understanding of mathematical statistics (p.1404-1406), but they summarize a bit here saying:

The results in this section proved that the pinball loss $(\lambda = 0)$ is not optimal, and that we should use the ELF loss with smoothness determined by $\tilde{h^*}$. This substitution of the smoothed loss greatly simplifies computation as it permits the use of smooth optimization methods for estimation.

So it appears at least at surface value to differ in some way, but how that effects something like $R^2$, $R^1$, or $D^2$ is beyond my understanding.

Edit 3

I find Dave's answer to be very useful. For my own sanity, I wanted to make sure that my application of what he said makes sense, so I provide some code here which I believe takes what he said and applies to to the QGAM I added earlier (though here I just use a single quantile to make things easier).

#### Load Libraries ####
library(qgam)
library(MASS)
library(quantreg)
Fit Single Quantile QGAM
fit <- qgam(
  accel ~ s(times),
  data = mcycle,
  qu = .25
)
Fit Quantile-Only QREG
fit.null <- rq(accel ~ 1,
               mcycle,
               tau=.25)
Pinball Loss Function
pinball_loss <- function(y, yhat, tau = 0.5){
if (length(y) != length(yhat)){
    print("Unequal lengths")
  }
  N <- length(y)
  ell <- rep(NA, N)
  for (i in 1:N){
    if (y[i] - yhat[i] >= 0){
      ell[i] <- tau * abs(y[i] - yhat[i])
    }
    if (y[i] - yhat[i] < 0){
      ell[i] <- (1 - tau) * abs(y[i] - yhat[i])
    }
  }
  return(sum(ell))
}
Save Y and Y Hats
y <- mcycle$accel
yhat.null <- fitted(fit.null)
yhat.fit <- fitted(fit)
Calculate Pinball Loss
mod.interest <- pinball_loss(y,yhat.fit,.25)
mod.null <- pinball_loss(y,yhat.null,.25)
mod.interest # much lower loss (more predictive power)
mod.null # high loss (less predictive power)
Get D2 Value
d2 <- 1-(mod.interest/mod.null)
d2

It appears this gets exactly what I want, but I want to make sure that this is correct before accepting any answers.

As a side note, it appears you cannot estimate an intercept-only model in qgam, which probably doesn't matter here, but found that an interesting difference between quantreg and qgam.

This is a really excellent question. I wish that Roger Koenker would state his reasons for not liking those $R^2$ measures. In general I favor the use of deviance-based measures. And these admit simple $R^{2}_\text{adj}$ measures as described here. As an aside, quantile regression requires Y to be very continuous (very few ties) and is not always as efficient as ordinal regression, which allows for arbitrary ties and floor and ceiling effects. — Frank Harrell, Oct 16 '23 at 12:08
When there is only an intercept in the model, quantile regression returns exactly the ordinary sample quantile. Sample quantiles misbehave if there are ties around the quantile, i.e., you can change the sample a lot and not have the quantiile change at all. Conversely adding a single observation can bump the quantile a silly amount. — Frank Harrell, Oct 16 '23 at 19:24
See my edit. I worry about whether or not it generalizes enough in this case. — Shawn Hemelstrand, Oct 24 '23 at 02:44
@ShawnHemelstrand I don’t know how to get something like adjusted $R^2$ that considers the degrees of freedom, but for the raw $D^2$, I’m having a hard time seeing how it could be inappropriate to consider a monotonic transformation of the loss function. That would seem to call into question using pinball loss at all. Could you please elaborate on your concerns? — Dave, Oct 24 '23 at 03:13
Well I only brought up pinball loss because that is mentioned in the link you provided, where it equates:
$$ R^2=1-\dfrac{\text{ Pinball loss of model of interest }}{\text{ Pinball loss of baseline model }} $$

My main concerns beyond what I stated are somewhat specific to what you said: in a multiple predictor case, I am not sure how to calculate the $D^2$ statistic, particularly while accounting for things like spline complexity (I'm not sure if that matters or not, just my thoughts). — Shawn Hemelstrand, Oct 24 '23 at 03:28
Oh sorry, the other reason I brought up pinball loss is because it seemed the $D^2$ referenced here relies on that somehow, but idk how to derive that in R for QGAMs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.d2_pinball_score.html#sklearn.metrics.d2_pinball_score — Shawn Hemelstrand, Oct 24 '23 at 05:15
They seem to use a smoothed loss as referenced in their original article in 2021. See edit. — Shawn Hemelstrand, Oct 25 '23 at 00:24
The edit makes it sound like they’re interested in the pinball loss but estimate it indirectly, somewhat analogous to being interested in MSE but explicitly minimizing ridge loss instead of MSE. I’d be comfortable using $R^2$ to assess a ridge regression (especially an out-of-sample $R^2$), despite the loss function not being MSE, so I think it makes sense to use pinball $D^2$, even when the explicit loss function is not pinball. — Dave, Oct 25 '23 at 00:59
I think it’s important to separate the model from the estimation technique. You’re looking at a particular Bayesian estimation of this model of interest to you, but that’s just one of many competing estimators. — Dave, Oct 25 '23 at 01:01
Thanks for that commentary. I guess my remaining question is how to obtain that estimate for a QGAM in R, but I realize that is perhaps more a coding question than a statistics one. — Shawn Hemelstrand, Oct 25 '23 at 01:07
There might be a function in the package, but if there isn’t, probably quantreg or qrnn can calculate pinball loss for your model (and an intercept-only model to give the $D^2$ denominator), and there’s always the option to write your own pinball loss function. // I plan to summarize these comments in an answer I’ll post tomorrow. — Dave, Oct 25 '23 at 02:42
Thank you Dave. That would be very useful. It appears there is a function called pinLoss within the qgam package, but its not immediately clear how to use it with respect to models estimated by QGAMs. See p.18-19 of the package doc here: https://cran.r-project.org/web/packages/qgam/qgam.pdf — Shawn Hemelstrand, Oct 25 '23 at 05:03

Dave · Accepted Answer · 2023-10-27T02:42:47.610

As I discuss here and (especially) here, the usual definition of $R^2$ can be viewed as a comparison of the performance of your model in terms of square loss to the performance of a baseline model that always predicts the conditional mean to be the marginal/pooled mean.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

The fraction numerator is the sum of squared residuals for your model, and the fraction denominator is the sum of squared residuals for a model that always predict $\bar y$. Using more words, we might write $R^2$ as:

$$ R^2=1-\dfrac{\text{ Square loss of the model of interest }}{\text{ Square loss of the baseline model }} $$

In a quantile regression, the typical metric of interest is the pinball loss. After all, this is the canonical loss function for quantile regression. Therefore, let's change square loss to pinball loss in the $R^2$ formula to give what sklearn refers to as $D^2$.

$$ D^2=1-\dfrac{\text{ Pinball loss of the model of interest }}{\text{ Pinball loss of the baseline model }} $$

Pinball loss is defined as follows.

$$ \ell_{\tau}(y_i, \hat y_i) = \begin{cases} \tau\vert y_i - \hat y_i\vert, & y_i - \hat y_i \ge 0 \\ (1 - \tau)\vert y_i - \hat y_i\vert, & y_i - \hat y_i < 0 \end{cases} $$

Then we add up each individual $\ell$ to get a pinball loss $L$ for the whole model. $$ L_{\tau}(y, \hat y) = \sum_{i=1}^n \ell_{\tau}(y_i, \hat y_i) $$

Thus, pinball loss's $D^2$ would be defined as:

$$ D^2=1-\dfrac{ \overset{N}{\underset{i=1}{\sum}} \ell_{\tau}(y_i, \hat y_i) }{\text{ Pinball loss of the baseline model }} $$

In this case, the $\hat y_i$ are the quantile regression outputs.

As far as the denominator goes, the logic for $R^2$ is that the reasonable naïve estimate of the conditional mean is the marginal/pooled mean. Translating this logic to quantile regression at quantile $\tau$, the reasonable naïve estimate of the conditional $\tau$ quantile is the marginal/pooled quantile $\tau$; denote this quantity as $y_{\tau}$. Thus, $D^2$ is calculated as:

$$ D^2=1-\dfrac{ \overset{N}{\underset{i=1}{\sum}} \ell_{\tau}(y_i, \hat y_i) }{ \overset{N}{\underset{i=1}{\sum}} \ell_{\tau}(y_i, y_{\tau}) } $$

So far, none of this mentioned linearity, just a comparison of model performance to that of a baseline that is analogous to the usual $R^2$. Consequently, I see this as totally viable in nonlinear quantile regressions.

NEXT is the issue of this paper discussed in the OP that uses an alternative loss function to estimate the quantile models. While I suppose you could stick the alternative loss function into this framework of comparing model loss to the loss incurred by a naïve model that makes the same prediction every time, the paper seems to be using this alternative loss function analogous to how linear regression uses ridge or LASSO loss. In such a case, the interest is still in square loss, but using an alternative optimiation criterion has advantages when it comes to how well the model will generalize. Consequently, whether the models are estimated with this more exotic technique or more classically, it seems like the interest is in pinball loss.

NEXT is the issue of an analogue of adjusted $R^2$. The usual adjusted $R^2$ considers the degrees of freedom. In a complicated model, especially one that you might tune using cross validation, it is not clear how to calculate the degrees of freedom, so I am not sold on the tractability of an adjusted $R^2$ analogue for quantile GAMs. However, to get a sense of overfitting-weary performance, it is possible to do an out-of-sample assessment. For an out-of-sample $D^2$, the numerator is calculated by comparing the out-of-sample predictions to the true out-of-sample values (which you know but withhold from the model training). The denominator is calculated using quantile $\tau$ of all $y$ values. This is where sklearn and I disagree, as the package will use the out-of-sample value while I would advocate for the in-sample value. Hawinkel, Waegman & Maere (2023) advocate for this for $R^2$, and it seems philosophically correct in other settings not to look at the out-of-sample data to make a comparison to a model you should not be able to fit.

FINALLY is a software implementation. I did not understand the qgam::pinLoss documentation. However, the package greybox appears to calculate pinball loss. If all else fails, it is not so difficult to write your own pinball loss function.

pinball_loss <- function(y, yhat, tau = 0.5){
if (length(y) != length(yhat)){
    print("Unequal lengths")
  }
  N <- length(y)
  ell <- rep(NA, N)
  for (i in 1:N){
    if (y[i] - yhat[i] >= 0){
      ell[i] <- tau * abs(y[i] - yhat[i])
    }
    if (y[i] - yhat[i] < 0){
      ell[i] <- (1 - tau) * abs(y[i] - yhat[i])
    }
  }
  return(sum(ell))
}
Check how this compares with greybox::pinball

set.seed(2023)
N <- 100
R <- 1000
diffs <- rep(NA, R)
for (i in 1:R){
  y <- rnorm(N)
  yhat <- rnorm(N)
  tau <- runif(1, 0, 1)
  diffs[i] <- pinball_loss(y, yhat, tau = tau) - 
    greybox::pinball(y, yhat, level = tau)
}
summary(diffs) # All differences are on the order of 10^-14, so 
               # basically perfect agreement

This agrees with greybox::pinball and can probably be written better, but it will give you the pinball loss.

REFERENCE

Stijn Hawinkel, Willem Waegeman & Steven Maere (2023) Out-of-sample $R^2$: estimation and inference, The American Statistician, DOI: 10.1080/00031305.2023.2216252

I've upvoted all of your answers because I find them to be very insightful. There is just one hair in the soup I am still figuring out with your answer, and I've edited my question to clarify that point. But overall I think this answer is very well written and helpful. — Shawn Hemelstrand, Oct 26 '23 at 02:35
I'm glad to have helped! I reviewed the code you wrote, and it looks correct. — Dave, Oct 26 '23 at 15:29
Much appreciated. I imagine this question will come up frequently with people who use QGAMs in the future, so I think this answer will do some good if people come across it down the road. — Shawn Hemelstrand, Oct 26 '23 at 15:34
@ShawnHemelstrand I'm on-and-off writing an expository article about unifying how to think about $R^2$ such as in this setting that I hope to put in a journal like The American Statistician (it would basically be a formal summary of a bunch of posts of mine from here), so I'm hoping this sort of thinking is of interest to others! — Dave, Oct 26 '23 at 16:37
If that happens, please post that here. I would be interested in reading it, and I'm sure others would value it. — Shawn Hemelstrand, Oct 26 '23 at 16:39
@ShawnHemelstrand It's nice to hear that others would be interested! The writing has mostly been "off" as of late, probably because I don't have a lot of confidence in it getting picked up by a real journal due to lack of interest. — Dave, Oct 26 '23 at 23:18

What sense does adjusted $R^2$ and deviance explained mean for quantile generalized additive models (QGAMs)?

Edit

Fit Multiple Quantile Model

Plot Pinball Loss

Edit 2

Edit 3

Fit Single Quantile QGAM

Fit Quantile-Only QREG

Pinball Loss Function

Save Y and Y Hats

Calculate Pinball Loss

Get D2 Value

1 Answers1

Check how this compares with `greybox::pinball`

Linked