I am currently searching for the best ARMA(p,q) model for my conditional mean. When comparing the AIC, BIC and LL, I see that some model perform better in AIC, some in BIC and some in LL. The AC and PAC only showed significance in lag 1. The bitcoin log returns have been taken as variable. Can someone help me which to choose for the conditional mean?
-
1The numbers are quite close as point estimates. Usually it is recommended to generate uncertainties as well, i.e., error bars. – patagonicus Apr 15 '22 at 11:44
-
6They correspond to different perspectives, hence none is "better" than the others. – Xi'an Apr 15 '22 at 11:45
-
@Xi'an Could you elaborate this in an answer? sounds like a great insight. – patagonicus Apr 15 '22 at 21:28
2 Answers
If you compare two models one of which is "bigger" than the other (i.e. has all the parameters of the other one and some more), the loglikelihood will always be larger for the bigger model, because with more parameters the data can be fitted better (I'm assuming here that parameters are fitted by maximum likelihood, and that the numerical procedure is good enough to find a solution that is better for the bigger model, as theoretically it should be).
This means that the loglikelihood cannot be used to compare models of different size (i.e., numbers of parameters), or at least not in a straight "bigger means better" way (see answer by dipetkov though). Both AIC and BIC are based on adding a penalty to the loglikelihood that is meant to account for the fact that the loglikelihood will always increase with more parameters. They use different principles to do this. The implication of the definitions is that (unless the data set is extremely small) the BIC will penalise complexity stronger than the AIC, meaning that it will usually prefer a smaller model (unless the two approaches agree).
The theoretical justification of AIC and BIC is somewhat complicated. Roughly I'd say that the AIC is to be preferred if your major aim is prediction quality (as a too big model may still predict well whereas a too small one usually doesn't), whereas the BIC is more motivated by the idea that there is a not too big true model and the aim is to find that. The BIC is often better at that, but then has a bigger probability than the AIC to choose a too small model, which is not good for prediction. In reality the truth is often not "small", however for certain reasons such as interpretability, smaller models are sometimes preferred even if they may predict a little bit worse, in which case one may prefer the BIC.
So very roughly one could say that from the point of view of AIC it's better to fit a too big model than a too small one (for reasons of prediction quality), whereas from the point of view of BIC too big and too small models are equally bad.
- 23,655
-
Nice answer. Does a larger (more parameters) model always give a larger LL? I can see that it would if the models were closely related and the larger simply has an extra parameter, but what if the models are quite different in their treatment of the parameters? – Michael Lew Apr 15 '22 at 21:16
-
@MichaelLew "Larger" in my sense means "same parameters plus some more", i.e. one model nested in the other. In general, if models are not nested, it may happen that the LL isn't always bigger for a model with more parameters, although more often than not it will be. Counterexample: Imagine a situation in which $y=x+x^2+e$. A model may involve lots of irrelevant variables each with its own (truly zero) regression parameter, but no squared term, and may still be worse for large enough data sets in terms of LL than the model that has simply coefficients for $x$ and $x^2$. – Christian Hennig Apr 15 '22 at 21:45
You can use the log-likelihood to compare models of different size, under one condition. (Though you might still prefer to use AIC as recommended by @Christian Henning.)
While it's true that bigger models fit the data better, if the two models you want to compare are nested, you can use a likelihood ratio test (LRT) to decide whether it's (statistically) justified to include the additional parameters. Two models are nested if we can simplify the bigger model into the smaller model by imposing constraints on the parameters, eg. by setting some parameters to 0.
The LRT statistic is $$ 2\big\{ \ell(\text{bigger model}) - \ell(\text{smaller model}) \big\} \sim \chi^2_p $$ where $\ell$ is the maximized log-likelihood, p is the number of extra parameters in the bigger model and $\chi^2_p$ is the chi-squared distribution with p degrees of freedom. Intuitively, we have a distribution for how much better the bigger model fits the data if its extra p parameters explain noise by chance.
Two models ARMA(p1,q1) and ARMA(p2,q2) are nested are long as p1 ≤ p2 and q1 ≤ q2.
So let's use a likelihood ratio test to compare the biggest and smallest model in your list, ARMA(0,0) and ARMA(2,2).
pchisq(2 * (2633.6256 - 2629.7241), df = 4, lower.tail = FALSE)
#> [1] 0.099
The p-value is 0.1, so your data doesn't need either an AR or a MA component. The ARMA(0,0) has the smallest AIC as well.
- 9,805
