Steps to building a GLM: Distribution and link function selection followed by independent variable subset selection?

Question

I'm trying to work out the order of operations when building a GLM.

I have 9 variables I could use as inputs for the model, however I may find that some are irrelevant to the independent variable and so I won't use them.

I was thinking I'd start off by using all 9 variables to pick a distribution and link function for the GLM. Looking at the dependent variable's data has told me that a Inverse Gaussian or Gamma distribution would probably be best, and I'm planning on testing the following link functions: Identity, Log, Inverse and [only for Inverse Gaussian] $\frac{1}{\mu^2}$.

So, I plan on comparing the 7 different models with all 9 possible independent variables (4 Inverse Gaussian models, 3 Gamma models), to see which distribution and link function performs the best. I believe it'd be best to compute AICs or BICs as they're useful measures to compare models.

From there, I'd work on finding the best subset of independent variables from the 9 to predict my dependent variable. I know there's a lot of different ways to do this, but I plan on using the bestglm() function (from the "bestglm" R package) which with my data will use complete enumeration (Morgan and Tatar, 1972). Again, the BICs or AICs of the models will be used to compare and choose the best.

I was just wondering if this is the best order to do things in, based on AIC or BIC:

Pick distribution and link function using all 9 possible dependent variables,
Perform independent variable subset selection with GLMs using distribution and link function picked in Step 1.

Morgan and Tatar, 1972 DOI: 10.1080/00401706.1972.10488918

This kind of stepwise-style approach to variable selection is problematic, as discussed by Gelman and Harrell. — Dave, Oct 30 '22 at 07:48
@Dave: that is why I asked about the objective. For inference, I agree (and I suspect that is what the OP wants to do). In prediction, some model fitting can indeed help, as long as one stays clear of overfitting too badly. — Stephan Kolassa, Oct 30 '22 at 07:50
@RubyPa The first criticism in the Harrell link gives a reason not to like stepwise variable section for pure prediction problems. However, if you are careful about validating your work (e.g., holdout data or bootstrap), it is less obvious that stepwise selection is a bad idea. It might turn out to perform worse than other approaches to handling a large number of variables, such a penalized regression (e.g., ridge), but that would require empirical evidence or a theorem I do not know and kind of doubt exists. // I plan to post a question about this tonight and will link it here. — Dave, Oct 30 '22 at 21:00

Steps to building a GLM: Distribution and link function selection followed by independent variable subset selection?

0 Answers0