PCA, stepAIC, and negative binomial regression

Question

I have some output data (around 800 data points) that very nicely fits a negative binomial distribution. I checked using fitdistr() in R and it is a very good fit.

Given this, my plan was to use negative binomial regression from other variables to derive a model so that i can predict the output data. I have around 65 variables that are available to use. I used glm.nb() in R to derive a model from all of these and only about 3 of the variables were significant.

I want to try to reduce the number of variables that aren't contributing very much and have been thinking about using PCA to identify the ones I can cut out. Is this sensible way of doing it? Or can PCA not be used for a negative binomial model?

I have also tried using stepAIC() in R, but i couldn't get it to work. It stopped after a single iteration (even if started from a simple 1 variable model and worked up, or backwards from a complete model). I also read people criticising the use of the stepAIC approach regardless. What are the problems with using it?

Any tips on both the points above (and negative binomial regression model selection) would be appreciated!

“Stepwise regression model-building is…pants. -Dr. Alexis Dinno — Dave, Aug 10 '22 at 14:12
Using relationships with Y to choose which X's to model is classic double dipping/cherry picking and invalidates all aspects of the analysis. There are extensive posts about this on this site. The only hope to get a stable analysis in your setting is to start with heavy use of unsupervised learning (data reduction) masked to Y. — Frank Harrell, Aug 10 '22 at 14:42
@FrankHarrell Are you arguing against supervised model selection in general? — John Madden, Aug 10 '22 at 14:44
@FrankHarrell Can you link to some posts about data reduction please? — StatisticsPersonInTraining, Aug 10 '22 at 15:09

PCA, stepAIC, and negative binomial regression

0 Answers0