10

Reading "An Introduction to Statistical Learning" (by James, Witten, Hastie and Tibshirani), on p.211 I came across the following formula for BIC in case of linear regression:

$ BIC = \frac{1}{n \hat{\sigma}^2} \left[ RSS + (\log{n}) d \hat{\sigma}^2 \right] $

up to a constant, and similarly for AIC. Here, $n$ is sample size and $d$ is the number of parameters. This seems contrary to the more popular formulation, where

$BIC = n \log(\hat{\sigma}^2) + d \log{n}$

The most obvious difference is that RSS in the first formula is not logged. Is the first formula wrong or am I missing something?

PA6OTA
  • 559

1 Answers1

6

There is no error but there is a subtlety. Note: In the second edition of ISLR model selection is discussed on pages 232-235 [1].

Let's start by deriving the log-likelihood for linear regression as it's at the heart of this question.

The likelihood is a product of Normal densities. Evaluated at the MLE: $$ \hat{L} = \prod_{i=1}^n\frac{1}{\sqrt{2\pi\hat{\sigma}^2}}\exp\left\{-\frac{(y_i - \hat{y}_i)^2}{2\hat{\sigma}^2}\right\} $$ where n is the number of data points and $\hat{y}_i$ is the prediction, so $y_i - \hat{y}_i$ is the residual.

We take the log and keep track of constants as they are important later on. $$ \log(\hat{L}) = -\frac{n}{2}\log(2\pi\hat{\sigma}^2) - \sum_{i=1}^n \frac{-(y_i - \hat{y}_i)^2}{2\hat{\sigma}^2} = -\frac{n}{2}\log(2\pi\hat{\sigma}^2) - \frac{RSS}{2\hat{\sigma}^2} $$ where RSS is the residual sum of squares.

What about the MLE $\hat{\sigma}^2$ of the error variance $\sigma^2$? It's also a function of the RSS.

$$ \hat{\sigma}^2 = \frac{RSS}{n} $$

And here is the subtle point. For model selection with AIC and BIC ISLR uses the $\hat{\sigma}^2$ from the full model to compare all nested models. Let's call this residual variance $\hat{\sigma}^2_{full}$ for clarity.

Finally we write down the Bayesian information criterion (BIC). d is the number of fixed effects.

$$ BIC = -2 \log(\hat{L}) + \log(n)d = n\log(2\pi\hat{\sigma}^2_{full}) + \frac{RSS}{\hat{\sigma}^2_{full}} + \log(n)d \\ = c_0 + c_1\left(RSS + \log(n)d\hat{\sigma}^2_{full}\right) $$

This is Equation (6.3) in ISLR up to two constants, $c_0$ and $c_1=\hat{\sigma}^{-2}_{full}$, that are the same for all models under consideration. ISLR also divides BIC by the sample size n.

What if we want to estimate $\sigma^2$ separately for each model? Then we plug in the MLE $\hat{\sigma}^2$ = RSS/n and we get the "more popular" formulation. We add 1 to the number of parameters because we estimate the error variance plus the d fixed effects.

$$ BIC = n\log(2\pi\hat{\sigma}^2) + \frac{RSS}{\hat{\sigma}^2} + \log(n)(d+1)\\ = n\log(2\pi RSS/n) + \frac{RSS}{RSS/n} + \log(n)(d+1)\\ = c^*_0 + n\log(RSS) + \log(n)(d+1) $$

The residual sum of squares RSS is the same in both versions of the BIC. [Since the effect estimates are $\hat{\beta} = (X'X)^{-1}X'Y$ and the predictions $X(X'X)^{-1}X'Y$ don't depend on $\sigma^2$.]

[1] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2nd edition, 2021. Available online.

dipetkov
  • 9,805
  • Thank you a lot for the detailed answer! To summarize the main point for the quick reader: $\sigma^2\approx \hat{\sigma}^2=RSS/n$ is inserted in the formua for the log likelihood of the model. Then we have the term $0.5 n \ln (\sigma)= 0.5 n \ln(RSS/n)$ left and the term $\frac{1}{\sigma^2}\sum_{i=1}^n (x_i-\mu)^2=n$. Therefore a term $ln(RSS)$remains. – Ggjj11 Jul 28 '22 at 09:03
  • 1
    The error variance $\sigma^2$ is a parameter just like the $\beta$s. We can either: (a) assume $\sigma^2$ is known and plug in a specific value, or (b) estimate $\sigma^2$ simultaneously with the mean structure parameters, the $\beta$s. The likelihood (and hence the AIC) is different in cases (a) and (b). – dipetkov Jul 28 '22 at 11:27