1

I don't have a statistical background, so a lot of this stuff confuses me (sorry if this is a bad question).

But, I am trying to construct a few Probit regression models in R on a dataset that consists of only categorical variables. All my dependent and independent variables are categorical.

  1. I have already identified what my dependent and independent variables are. However, when I am constructing the Probit model I am pretty sure I need to make sure all my independent variables that will be included in each model are, well, independent from each other. So to do this, I would run the Chi Squared test of independence between each of the independent variables, right?

  2. If the Chi Squared Test of independence says the two predictor variables are not independent, then I think I should only choose one to put in my regression model. Is there a test to run that tells me which predictor variable to include over the other?

  3. I think I have to convert all variables into a 0, 1 dummy variable before constructing the regression. Can I do the Chi-squared test on the variables before making them into dummy variables?

1 Answers1

1
  1. The chi-squared test is a good test of the independence of two categorical variables, yes.

  2. Generalized linear models like probit regression make no assumptions about feature independence. There are advantages of feature independence, such as narrower confidence intervals and ease of interpretation, but there are disadvantages to dropping variables, such as biasing coefficient estimates and lowering predictive capacity. It is not a given that you should drop features just because they are related. Each could contain its own unique contribution to the outcome. Further, feature selection is known to lack stability, so determining which features to drop and which to keep is iffy. See Frank Harrell’s post about feature selection.

  3. To a large extent, this is about the particular software in which you conduct the test. If the function wants 0/1 encoding, you have to do it that way. If the software wants a contingency table, you do it that way. If the software wants a data frame, you do it that way. (A good exercise might be to figure out how to run the test in each of those ways and why they are equivalent.)

Dave
  • 62,186