I have been reading Hyndman & Athanasopoulos "Forecasting: Principles and Practice" (newest edition here) recently, and I noticed something that I regard as a possible inconsistency. On the one hand, the book generally advocates model selection based on information criteria (typically AICc, sometimes BIC). On the other hand, it also pays attention to residual diagnostics and tends to override the the IC-based selection if, say, the residuals are found to be significantly autocorrelated. The ACF and PACF plots that are employed for residual diagnostics use 95% critical values, and the book takes these are the reference cut-off point. It is the 95% level that bothers me in such situations.
Consider the following example. Suppose the true data generating process is
$$
y_t=0.5y_{t-4}+u_t
$$
with $u_t\sim \text{i.i.d.}N(0,\sigma^2)$.
Suppose model selection using auto.arima based on AIC suggests ARIMA(0,0,0). The model ARIMA(4,0,0) with the first three lags set to zero has a lower AIC value, but it has not been selected, as it did not belong to the set of models that were considered as candidates. The PACF of the residuals shows a that lag 4 is insignificant at 95% confidence level, just as all the other lags. However, it is significant at 90% confidence level. (This is not what we can immediately see from the graph, as the 90% critical values are not depicted.) Had we used 84% critical values in the PACF, we would have discovered the 4th lag and with it the superior model (superior in terms of forecasting performance measured by expected likelihood on new data). The level of 84% comes from the fact that AIC would prefer including a term in the model if its $p$-value were below 16% (not 5%); see e.g. Glen_b's answer to Why does model selection using AIC yield non-significant p-values for the variables?
Question: When selecting a model for forecasting, what critical level should be used in diagnostic tests such as whether (partial) autocorrelations of the residuals are zero?
This may be too general to have a simple answer, but I am looking for principles that would suggest an appropriate critical level.