3

If we take the R-squared to be the comparison of Deviances between models (the model of interest, the saturated model, and the constant model), we can write it as (see this answer CC BY-SA 4.0):

$$R_{GLM}^2 = 1-\frac{D_{RES}}{D_{TOT}} = \frac{D_{REG}}{D_{TOT}}.$$

Deviance is expressed as comparison of log-likelihoods like:

$$\begin{matrix} \text{Null Deviance} \quad \quad \text{ } \text{ } & & \text{ } D_{TOT} = 2(\hat{\ell}_{S} - \hat{\ell}_0), \\[6pt] \text{Explained Deviance} & & D_{REG} = 2(\hat{\ell}_{p} - \hat{\ell}_0), \\[6pt] \text{Residual Deviance}^\dagger \text{ } & & \text{ } D_{RES} = 2(\hat{\ell}_{S} - \hat{\ell}_{p}). \\[6pt] \end{matrix}$$ In these expressions the value $\hat{\ell}_S$ is the maximised log-likelihood under a saturated model (one parameter per data point), $\hat{\ell}_0$ is the maximised log-likelihood under a null model (intercept only), and $\hat{\ell}_{p}$ is the maximised log-likelihood under the model (intercept term and $p$ coefficients).

The Gaussian log-likelihoods, assuming the variance of the response is constant among observations, but keeping it explicitly different among models, are as below.

\begin{cases} \hat{\ell}_{p}= - \frac{n}{2} \log{\left(2 \pi\sigma_p^2\right)} - \frac{ \sum_{i=1}^{n}{{\left(y_i - \hat y_i\right)}^2}}{2 \sigma_p^2}\\ \hat{\ell}_{0}= - \frac{n}{2} \log{\left(2 \pi\sigma_0^2\right)} - \frac{ \sum_{i=1}^{n}{{\left(y_i - \bar y\right)}^2}}{2 \sigma_0^2}\\ \hat{\ell}_{S}= - \frac{n}{2} \log{\left(2 \pi\sigma_S^2\right)} \end{cases}

If we assume $\sigma_p^2=\sigma_0^2=\sigma_S^2=\sigma^2$ is constant for the three models, we retrieve the usual definition of the $R^2$ (because the log-variance terms cancel out in the subtraction for each deviance term).

$$\begin{equation} \begin{aligned} D_{TOT} = \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \bar{y})^2 = \frac{1}{\sigma^2} \cdot SS_{TOT}, \\[6pt] D_{REG} = \frac{1}{\sigma^2} \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 = \frac{1}{\sigma^2} \cdot SS_{REG}, \\[6pt] D_{RES} = \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{\sigma^2} \cdot SS_{RES}. \\[6pt] \end{aligned} \end{equation}$$

As outlined above, the only way to make the variances vanish in the deviance is to assume they are the same, i.e., the models estimate expectations only, with the assumption that the variance is independent from that.

Can this assumption be justified however? Why should we assume the variance be the same? For the constant model, in realistic scenarios we are pretty much guaranteed to not find the same estimate for example. I understand in this class of model we are mostly interested in the location parameters, but I can see this might become a problem for other classes of models.

This can be very problematic when computing R-squared in new samples (for example, all the discussion around this question and other linked questions), and should appear in other distribution choices in GLMs, with single-parameter distributions (Bernoulli, Binomial with known n, Multinomial with known n, etc) being excluded.

Igor F.
  • 9,089
Firebug
  • 19,076
  • 6
  • 77
  • 139
  • I don't understand what you refer to by "assumption," because in analyses of deviance there might not even be any well-defined variances. So what does "variance be the same" even refer to? And "same" compared to what? – whuber Nov 28 '23 at 14:58
  • I believe we are assuming that the variance term for the likelihood of each model is the same @whuber, otherwise they would not cancel/factorize out in the Deviance – Firebug Nov 28 '23 at 15:10
  • 1
    By breaking down the deviance in this way, aren't we assuming there is just one (common) variance to estimate? The likelihood would be different if we allowed for multiple variances, right? – Dave Nov 28 '23 at 15:12
  • 1
    Only special models even have any kind of "variance term." I cannot see anything that would qualify as a "variance" in your first collection of expressions involving deviances. If you're only trying to ask about the Gaussian case, then why introduce deviances at all? Your question seems to be about what the standard assumptions are in linear regression with Gaussian conditional responses. – whuber Nov 28 '23 at 15:18
  • As far as I understand, even Gaussian GLMs have a variance term, but due to the orthogonality between parameters, we often choose to not estimate it, but we can still estimate it if we'd like. The only way for the variance to vanish is if we assume the variance in the three models is the same. – Firebug Nov 28 '23 at 15:31
  • Your use of "even" is puzzling. Consider, say, a logistic regression. Where is the "variance term" you refer to? – whuber Nov 28 '23 at 16:11
  • In the bernoulli/binomial distribution a single parameter encodes both location and scale. So the likelihood is fully specified, thus not being affected by the topic of this question. This is not the case for Gaussian, quasi-binomial, quasi-poisson, lognormal, etc, and also not the case for Beta-Binomial, Beta, Dirichlet, Dirichlet-Multinomial – Firebug Nov 28 '23 at 16:24
  • Your question remains unclear, because so far you haven't offered any response to my queries about what you might mean by "variance term" and how it's related, if at all, to your mentions of deviance. – whuber Nov 28 '23 at 17:26
  • I think it was clear enough: deviance comes from a comparison of log likelihoods; likelihoods in location-scale distributions will have a scale parameter analogue; in the example of the Gaussian likelihood it's the variance; the deviance ignores that term – Firebug Nov 29 '23 at 08:42
  • Your latest comment contains the very first occurrence of "location-scale" in this thread. If that's what you're interested in, then please make that explicit in your question. – whuber Nov 29 '23 at 15:32
  • I'm interested in the variance for each model in the computation of the Deviance for the case of the Gaussian likelihood. – Firebug Nov 29 '23 at 16:01
  • "By breaking down the deviance in this way, aren't we assuming there is just one (common) variance to estimate? The likelihood would be different if we allowed for multiple variances, right?", yes, and that's the crux of the question @Dave . How to justify that assumption when we already know the variances are almost guaranteed to be different? I can see how that assumption might be harmless for the saturated model, which is a idealized model anyways, but what about the null model, which we do actually estimate? – Firebug Nov 30 '23 at 10:39
  • I may have misread which variances are assumed equal, and I have deleted my answer until I wrap my head around that. $//$ Since an argument of mine is referenced in the OP, I do think I should respond to that in some way (which the second paragraph of my answer did). – Dave Mar 29 '24 at 14:52

0 Answers0