7

Let's assume we have two linear regression models $m_1$ and $m_2$, where $m_2$ is nested in $m_1$, and two data sets $d_1$ and $d_2$ which are of different size.

Calculating the AIC for each pair shows that the following equation is true:

$$ AIC_{m_2,d_1} - AIC_{m_1,d_1} > AIC_{m_2,d_2} - AIC_{m_1,d_2} $$

Can we conclude from this equation that the non-nested/full model $m_1$ "improves" $m_2$ on data $d_1$ more than on data set $d2$?

The conclusion might not be valid because the AIC values are calculated on different data sets and therefore might not be comparable. However, since actually AIC differences are compared the conclusion might be valid. What is your take on this?

Funkwecker
  • 3,082
  • Q: What happens when you divide the left hand side by the size of d1 and the right hand side by the size of d2? – jwimberley Dec 20 '16 at 14:48
  • 1
    No. AIC is specified to a given set of data, it is conceptually wrong to draw conclusion among datasets. A way you can "order"(compare them if you really want) models among different datasets is to use Kullback-Leibler divergence, a reference is Perlman(1983)The limiting behavior of multiple roots of the likelihood equation. – Henry.L Dec 20 '16 at 16:08

1 Answers1

1

The AIC criterion scales with the overall size of the dataset, and this is true for differences in AIC values as well. The criterion is based on the relationship $$ -2 \, \mathrm{E}[\log \mathrm{Pr}_{\hat \theta}(Y)] \approx -\frac{2}{N} \, \mathrm{E}[\mathrm{loglik}] + \frac{2d}{N} $$ where $d$ is the number of parameters in the likelihood function being maximized (Elements of Statistical Learning equation 7.27). The term on the left is the expected out-of-sample "error" rate, using the log of the probability as the error metric. The right hand consists of the in-sample error rate estimated from the maximized log-likelihood, plus the term $2d/N$ correcting for the optimism of the maximized log-likelihood. The most important factor here is the $N$ in the denominator of the right hand side. The AIC is typically defined as $$ \mathrm{AIC} = -2 \, \mathrm{loglik} + 2d $$ (although the ESL textbook adds a $1/N$ factor). In this form, the AIC predicts $N$ times the out-of-sample error rate. To compare AIC differences from two samples, you should divide the AIC values by the sample size to compare them on equal terms.

jwimberley
  • 3,974
  • I disagree, dividing by sample size will not solve any problem. Consider the gaussian case where AIC is equivalent to F-tests for nested models, adjust for sample size is proven to be problematic as adjusted R^2. Read my comments above. – Henry.L Dec 20 '16 at 16:10
  • In fact ESL suggest this adjustment only when sample size is of the major concern and some special structure of the dataset will easily break this criteria, say clusters. – Henry.L Dec 20 '16 at 16:11
  • @Henry.L I meant only that dividing by $N$ was the minimal step necessary for this comparison (its absolutely better than the comparison without the correction), but I see your point that there are probably bigger problems involved in making this comparison using the AIC. – jwimberley Dec 20 '16 at 16:30
  • 1
    @CagdasOzgenc You probably misunderstand my point. I do not think the question is even relevant to sample size. The point is you cannot compare two models fitted to different data sets using any information criterion. If my memory is correct, Christopher has discussed this question in his IMS monograph(small sample asymptotics). Thanks for the input and happy holiday! – Henry.L Dec 24 '16 at 13:41
  • 1
    @Henry.L I cannot attest for conditional distributions, but for unconditional distributions I claim that you can work on two different sample sets (and possibly with different lengths) generated by the same true data process provided that the sample size is large. In each case AIC/sample size will yield expected cross entropy, hence they are comparable IMHO. – Cagdas Ozgenc Dec 24 '16 at 19:54
  • 2
    @CagdasOzgenc "...generated by the same true data process provided that the sample size is large." The problem in reality is that you never know whether two datasets are actually generated from the same model/process. Even if they are, the adjustment is no more than a rescale and hence does not address the problem raised by the OP since he stated very clearly that two models are nested. I hope it is clear to you now. It is just very wrong to do so because ESL said so, "machine learning" provides many results that is only valid in specific cases. – Henry.L Dec 24 '16 at 20:02
  • @Henry.L Nested model doesn't necessarily mean a conditional model, but since OP also mentioned regression (a conditional model) I will think about it and get back to you. In the meantime I stick to my claim. Dividing by the sample size is not a simple rescale. It is same as finding the mean of two different data samples. – Cagdas Ozgenc Dec 24 '16 at 20:06
  • @CagdasOzgenc (1) By "recale" I mean rescale for SS from both models since it is just F-tests, probably I should use it rigidly. (2) My point is AIC can only be used to compare two models fitted to one data set. By saying a large sample from the same model you are actually assuming two datasets are asymptotically the same dataset by central limit theorem. So your comment does not harm my claim. (3) It is completely not relevant to conditionality, I do not think you understand my point and OP's question correctly? – Henry.L Dec 24 '16 at 20:22
  • 2
    @Henry.L According to your logic we shouldn't do cross-validation studies because when we split a data set into two subsets (one for training one for validation) they may god forbid (pun intended) come from different processes. – Cagdas Ozgenc Dec 24 '16 at 20:43
  • I think you both bring up good points. While I think my answer addresses a "grammatical" mistake in the question -- like must be compared with like -- it might be better to ask why the AIC is being used in this way. Are the separate datasets more than merely statistically different samples? Why not fit the model to the combined dataset but including an interaction effect of the dataset type on the extra parameter in the larger of the nested models? – jwimberley Dec 25 '16 at 04:04
  • @RichardHardy Thanks for the recommendation -- I'll consider this, but I think the answers on that page seem deep and comprehensive and I don't fully understand them (though maybe that is your point?) – jwimberley Mar 21 '17 at 13:18
  • @RichardHardy OK, thanks again for the suggestion, I added an answer. I made the modifications that I think were necessary, but let me know if there is anything else you think should be added! – jwimberley Mar 21 '17 at 13:59