I am aware that a question very similar to mine has already been asked here (Should AIC be reported on training or test data?), but some points remain unclear to me.
The accepted answer states:
On the other hand, when the model is evaluated on test data (not the same as the training data), there is no bias to −2ln(L) . Therefore, it does not make sense to penalize it by 2p , so using AIC does not make sense; you can use −2ln(L) directly.
Could someone elaborate more on this? I don't see why the number of parameters in a model is relevant in the train data, but not anymore in the test data.
Is it correct that the AIC is only a measure for in-sample performance then?