2

If pvalues aren't useful to look at after performing AIC variable selection (Why are p-values misleading after performing a stepwise selection?), what should be the right thing to do in a scientific paper to basically say: "those variables are the ones that are (likely to be) important".

Should we consider all the selected variables? Should we not only consider the selected variables that have a very very small pvalue (for example shifting the standard 0.05 to 0.01? Is there a formula for this?)

Any reference appreciated, I've tried to go through several posts, lots of conflicting opinions and most of the time without reference, which makes it hard to provide an explanation as to why I will do what I will do.

EDIT: some people also say to avoid any AIC variable selection ("You really want to avoid automated model selection methods, if you possibly can. If you must use one, try LASSO or LAR.") which makes me even more confused. What should one do when they have around 15 metrics and trying to infer which metrics are useful metrics to determine if the patient has a certain disease or not. I feel like this is a relatively standard problem.

EDIT2 : I understand how my p values will be badly biased. I'm however looking to find a way to get non biased p values to say: "those variables are statistically significantly correlated to the outcome variable".

  • It's difficult to determine what comparison you are asking us to make. What do you mean, exactly, by "AIC variable selection" (there are many ways and many different contexts for employing AIC for this purpose) and what specifically are these "p-values" to which you refer? And what does "consider[ing]" a variable actually amount to? – whuber Jul 18 '22 at 15:05
  • I have around 15 variables, and I compute the AIC (in a linear regression model) for all combinations of those 15 variables (only 1, 2, 3, ... of them, all combinations every time), etc ... Then, I have a set of selected variable, and I compute a final linear regression model (that I have already done before to be fair) and compute the p-values. All of this is done with statsmodels, a Python module. I'm not sure what the p-values actually refer to, but if I remember well, it's always the F-statistic when considering Linear Regression models? – FluidMechanics Potential Flows Jul 18 '22 at 15:12
  • 1
    If you have really assessed all $2^{15}=32,768$ different models, then that is a lot of data mining. Any NHST and p values you calculate after such a step will be badly biased. (You can simulate this yourself.) If you are after prediction, consider using a holdout sample, performing variable selection on the training sample only, then assessing the predictions on the holdout sample. Alternatively, use a ML method and report the standard variable importance measures. – Stephan Kolassa Jul 18 '22 at 15:30
  • I did, it took some time to compute, but it wasn't that bad. It's actually $2^{14}$ because one column is the output. I'm not sure I understand what to do with the holdout sample. I can get an accuracy or something, but no p-values ?! I might be misunderstanding something. – FluidMechanics Potential Flows Jul 18 '22 at 22:50
  • Just maybe to make it more clear (and I added it to the post): I understand how my p values will be badly biased. I'm however looking to find a way to get non biased p values to say: "those variables are statistically significantly correlated to the outcome variable". – FluidMechanics Potential Flows Jul 19 '22 at 08:55
  • 3
    There is no variable selection method, including lasso, that has a halfway decent probability of finding the right variables. For whatever measure of variable importance you choose (including the rank across predictors) you can use the bootstrap to get confidence intervals on that measure. For an example of bootstrapping importance ranking see Section 5.4 of https://hbiostat.org/doc/rms.pdf . This fully exposes the difficulty of the task, i.e., the fact that the data do not possess sufficient information for making reliable discernments of which predictors are not important. – Frank Harrell Jul 19 '22 at 11:15
  • When you mean rank across predictors, in my case, I could use the p values of the lowest AIC model to rank my predictors on a particular boostrapped dataset, and then doing it n times to get confidence intervals on each predictor? – FluidMechanics Potential Flows Jul 24 '22 at 13:33

0 Answers0