AIC for logistic regression

Question

On page 231 of The Elements of Statistical Learning AIC is defined as follows in (7.30)

Given a set of models $f_\alpha(x)$ indexed by a tuning parameter $\alpha$, denote by $\overline{err}(\alpha)$ and $d(\alpha)$ the training error and number parameters for each model. Then for a set of models we define

$$AIC(\alpha) = \overline{err}(\alpha) + 2 \cdot \frac{d(\alpha)}{N}\hat{\sigma_\epsilon}^2$$

Where $\overline{err}$, the training error, is $\frac{1}{N}\sum_{i=1}^NL(y_i,\hat{f}(x_i))$.

On the same page it is stated (7.29) that

For the logistic regression model, using the binomial log-likelihood, we have

$$AIC = -\frac{2}{N} \cdot \text{loglik} + 2 \cdot \frac{d}{N}$$

where "$\text{loglik}$" is the maximised log-likelihood.

The book also mentions that $\hat{\sigma_\epsilon}^2$ is an estimate of the noise variance, obtained from the mean-squared error of a low-bias model.

It is not clear to me how the first equation leads to the second in the case of logistic regression? In particular what happens to the $\hat{\sigma_\epsilon}^2$ term?

Edit I found in a later example in the book (on page 241) the authors use AIC in an example and say

For misclassification error we used $\hat{\sigma_{\epsilon}}^2=[N/(N-d)] \cdot \overline{err}(\alpha)$ for the least restrictive model ...

This doesn't answer my question as it doesn't link the two aforementioned expressions of AIC, but it does seem to indicate that $\hat{\sigma_{\epsilon}}^2$ is not simply set to $1$ as stated in Demetri's answer.

Demetri Pananos · Answer 1 · 2020-05-08T15:30:49.460

3

I imagine they made an approximation. $\sigma^2_\epsilon$ is the residual variance of the outcome conditioned on the variables $x_i$. When the outcome is binary, as in logistic regression, $\sigma^2<1$.

When we compare models with AIC, only the absolute differences between models matter, so using the approximation $\sigma^2=1$ for all models isn't so offensive. Let me demonstrate

$$\Delta AIC = AIC_1 - AIC_2 = \dfrac{-2}{N}(\text{loglik}_1 - \text{loglik}_2) + \dfrac{2}{N}(d_1 - d_2) $$

Because we made the assumption that $\sigma^2_\epsilon$ was the same for each model (namely, it was 1) it would factor out of the difference between models' effective number of parameters. Setting $\sigma^2_\epsilon=1$ isn't arbitrary, it is an upper bound on the variance of a binary variable. A least upper bound would be 0.25 and it it isn't quite clear to me why that wasn't chosen, but again the choice of $\sigma^2_\epsilon$ seems only to affect the AIC values and not the differences between model AIC, which is what we're really after.

edited May 08 '20 at 15:30

answered May 08 '20 at 12:43

Demetri Pananos

36,121

Hi @Demetri Pananos , maybe I am missing a detail here but it seems to me that arbitrarily setting $\hat{\sigma_\epsilon}^2$ could, in fact, have an effect on the absolute difference between models since it is only multiplied by the second term. Is there something I am missing here? – Seraf Fej May 08 '20 at 14:27
@SerafFej See my edit. – Demetri Pananos May 08 '20 at 15:30
Thanks for the clarification. Would it not be fair to say, however, that this choice $\hat{\sigma_\epsilon}^2 = 1$ still effects which model has lower AIC? In your example imagine the case where $\text{loglik}1 - \text{loglik}_2 = 0.5$, $d_1 - d_2 = 1$ and $N = 2$ (these are just random numbers to make the point). Then we obtain that the delta is positive and therefore we should use model 2. However, if we had chosen $\hat{\sigma\epsilon}^2 = 0.25$ instead, then the delta would be negative, so we should use model 1. This would seem to contradict your point? – Seraf Fej May 08 '20 at 16:00

dipetkov · Answer 2 · 2022-08-20T09:43:55.813

Context: This question is about concepts discussed in Chapter 7: "Model Assessment and Selection" in The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2008) by T. Hastie, R. Tibshirani and J. Friedman.

ELS discusses whether model selection according to AIC is the same as model selection according to predictive performance on new data. This relationship holds exactly for additive models with squared error; it holds approximately for other models, incl. logistic regression.

Eq. (7.29) leads to Eq. (7.30) only in the case of squared error loss and Gaussian likelihood with known variance. The $\hat{\sigma}^2_\epsilon$ term is the error variance in the Gaussian model and doesn't exist in logistic regression.

For an additive error model $Y = f(X) + \epsilon$ with $d$ parameters ($d$ inputs and/or basis functions) fit under squared error loss (basically, linear regression), the $C_p$ statistic is given by:

$$ C_p = \overline{\text{err}} + 2\frac{d}{N}\sigma^2_{\epsilon} $$ where $\overline{\text{err}}$ is the training error averaged over $N$ training examples.

Now say we are considering a set of Gaussian models, all of the form $Y = f_\alpha(X) + \epsilon$, where $\alpha$ is a tuning/hyper-parameter and the errors are iid Normal with mean 0 and variance $\sigma^2_\epsilon$. Each model has effective number of parameters $d(\alpha)$. So a model with splines has more parameters than a model that is linear in all variables even if both models take exactly the same inputs.

We estimate the error variance $\sigma^2_\epsilon$ from the "largest" (with most parameters) model in the set. Due to the bias-variance tradeoff we expect the largest model to have the smallest bias. From this point on we treat $\hat{\sigma}^2_{\epsilon}$ as known: even though we obtain the estimate from one specific model, we assume that all models in the set have the same error variance $\hat{\sigma}^2_{\epsilon}$.

Now we are working with a set of models indexed by $\alpha$, we modify the formula above accordingly.

$$ \operatorname{AIC}(\alpha) = \overline{\text{err}}(\alpha) + 2\frac{d(\alpha)}{N}\hat{\sigma}^2_{\epsilon} $$

We probably should call this $C_p(\alpha)$ but to make difficult concepts even harder, ESL refers to $C_p$ and AIC "collectively" as AIC. So keep in mind that the formula for $\operatorname{AIC}(\alpha)$ is derived from the $C_p$ statistics.

Math simplifies nicely when the likelihood is Gaussian with known variance and the loss function is the sum of squared errors.

$$ \begin{aligned} \frac{\operatorname{AIC}(\alpha)}{\hat{\sigma}^2_\epsilon} &= \frac{\overline{err}(\alpha)}{\hat{\sigma}^2_\epsilon} + 2\frac{d(\alpha)}{N} \\ &= \frac{2}{N}\sum_{i=1}^n\frac{\left(y_i-\hat{f}_\alpha(x_i)\right)^2}{2\hat{\sigma}^2_\epsilon} + 2\frac{d(\alpha)}{N} \\ &= \operatorname{const} -\frac{2}{N}\operatorname{loglik}(\alpha) + 2\frac{d(\alpha)}{N} \end{aligned} $$ where the constant depends on $\hat{\sigma}^2_\epsilon$ and $N$ but not on $\alpha$. That's Eq. (7.29) in the last line.

We've shown that model selection with $\operatorname{AIC}(\alpha)$ in Eq. (7.30) is equivalent to model selection with $\operatorname{AIC}$ in Eq. (7.29) in the case of Gaussian likelihood with known variance and squared error.

Otherwise the relationship between AIC and expected error is approximate, not exact.

Chapter 7 mentions this several times. For example, Figure 7.4 is about fitting logistic regression with log-likelihood (ie. entropy) loss and 0-1 loss. The caption explains:

Although the AIC formula does not strictly apply here, it does a reasonable job.

Source: The Elements of Statistical Learning

In the left panel (entropy loss) AIC agrees well with a test-sample estimate of model error for all models but the extremely over-paremetrized one; the link between AIC and model performance falls apart in the extreme case. In the right panel (0-1 loss) AIC is not as good an estimate of model error (the green line doesn't track the blue line closely) but you will still select the a good model by minimizing AIC (the one with 16 basic functions); the best model according to the test-sample error is the one with 32 basis functions.

AIC for logistic regression

2 Answers2