Equivalence of AIC and p-values in model selection

Question

In a comment to the answer of this question, it was stated that using AIC in model selection was equivalent to using a p-value of 0.154.

I tried it in R, where I used a "backward" subset selection algorithm to throw out variables from a full specification. First, by sequentially throwing out the variable with the highest p-value and stopping when all p-values are below 0.154 and, secondly, by dropping the variable which results in lowest AIC when removed until no improvement can be made.

It turned out that they give roughly the same results when I use a p-value of 0.154 as threshold.

Is this actually true? If so, does anyone know why or can refer to a source which explains it?

P.S. I couldn't ask the person commenting or write a comment, because just signed up. I am aware that this is not the most suitable approach to model selection and inference etc.

(1) Prognostic modeling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Statistics in Medicine, 19, 1059-1079 (2) true for variables with df1, based on aic definition. But could be lower if your degrees of freedom of variables higher — charles, Mar 07 '14 at 21:35

score 23 · Accepted Answer · answered Mar 07 '14 at 21:09

23

Variable selection done using statistical testing or AIC is highly problematic. If using $\chi^2$ tests, AIC uses a cutoff of $\chi^2$=2.0 which corresponds to $\alpha=0.157$. AIC when used on individual variables does nothing new; it just uses a more reasonable $\alpha$ than 0.05. A more reasonable (less inference-disturbing) $\alpha$ is 0.5.

answered Mar 07 '14 at 21:09

Frank Harrell

91,879
6
178
397

1

+1 I spent so long constructing my (now deleted) answer, I didn't even see this one had posted in the meantime. I would have just voted this one up instead. – Glen_b Mar 07 '14 at 21:38
What does "AIC when used on individual variables" mean? I understand how AIC can be seen to be equivalent to a p-value calculated using the likelihood-ratio test, on individual nested models. I assumed that it was related to what the OP said about "throwing out the variable with the highest p-values..." But in this case the p-values seem to relate to whether the predictor has a relationship with the response variable. – Alex Apr 26 '23 at 20:45

retodomax · Answer 2 · 2023-02-02T12:30:49.180

Maybe some detailed explanation about the excellent answer of @Frank Harrell

The test statistic of a Likelihood-ratio test (LRT) is defined as (Wikipedia)

$$ \lambda_{\text{LR}} = -2(\ell_0 - \ell_A) $$

where $\ell_i$ is the log likelihood of model $i$. Under $H_0$

$$ \lambda_{\text{LR}} \overset{a}{\sim} \chi^2_q. $$

The AIC is defined as (Wikipedia)

$$ \text{AIC} = 2k - 2\ell $$

where $k$ is the number of estimated parameters and $\ell$ is the log likelihood.

The difference in AIC between the two models (let's say Model $0$ and Model $A$ where the difference in the number of free parameters is $q$) is given by

\begin{align*} \Delta\text{AIC} &= 2k_0 - 2\ell_0 - (2k_A - 2\ell_A) \\ &= -2q - 2(\ell_0 - \ell_A). \end{align*}

Therefore,

$$ \Delta\text{AIC} + 2q = \underbrace{-2(\ell_0 - \ell_A)}_{\lambda_{\text{LR}}}. $$

This shows a direct association between AIC and LRT.

The AIC of both models will be equal if $\lambda_{\text{LR}} = 2q$
The AIC of the null model will be smaller if $\lambda_{\text{LR}} < 2q$
The AIC of the alternative model will be smaller if $\lambda_{\text{LR}} > 2q$

If we select models by AIC we implicitly apply a LRT and check if the $\lambda_{\text{LR}}$ is larger or smaller then $2q$. The $2q$ threshold corresponds to a specific p-value of the LRT which can be calculated in R using pchisq(2q,df=q,lower.tail=FALSE). In the following you find a table with lists some p-values for different values of $q$.

$$ \begin{array}{rrr} \hline q & \lambda_{\text{LR}} & p\text{-value} \\ \hline 1 & 2 & 0.157 \\ 2 & 4 & 0.135 \\ 3 & 6 & 0.112 \\ 5 & 10 & 0.075 \\ 10 & 20 & 0.029 \\ 20 & 40 & 0.005 \\ \end{array} $$

For example, selecting between two models based on AIC where the nested model has 3 parameters constrained compared to the alternative one is equivalent to making a LRT and rejecting the null model at a significance level of $0.112$.

A similar association can be made between LRT and BIC which is defined as (Wikipedia)

$$ \text{BIC} = \log(n)k - 2\ell $$

where $n$ is the number of observations, $k$ is the number of estimated parameters, and $\ell$ is the log likelihood. Using the same approach as above we see that

$$ \Delta\text{BIC} + \log(n)q = \underbrace{-2(\ell_0 - \ell_A)}_{\lambda_{\text{LR}}}. $$

In the following you find a table which lists corresponding p-values of the LRT for different values of $n$ and $q$

$$ \begin{array}{rrrr} \hline n & q & \lambda_{\text{LR}} & p\text{-value} \\ \hline 10 & 1 & 2.30 & 0.1292 \\ 10 & 2 & 4.61 & 0.1000 \\ 10 & 3 & 6.91 & 0.0749 \\ 100 & 1 & 4.61 & 0.0319 \\ 100 & 2 & 9.21 & 0.0100 \\ 100 & 3 & 13.82 & 0.0032 \\ \end{array} $$

Equivalence of AIC and p-values in model selection

2 Answers2

Linked