I ran LASSO with logistic regression to obtain a list of "important" variables. For factor variables, I created one-hot encoded dummy variables using the step_dummy function in the tidymodels world.
After running LASSO, I inspected the list of variables that were kept and noticed that some of dummy variables were deemed "unimportant" by LASSO and were thus set to 0. Does it make sense to only keep some of the dummy variables (i.e., the non-zero ones) when running a final logistic regression model? For example, for race, 5 indicator variables were created using one-hot encoding: White, Black, Asian, Hispanic, and Other. LASSO only deemed White and Hispanic important and dropped the other 3. Is it ok to just include White and Hispanic in my logistic regression to make predictions?
logit(p) = intercept + 0.1*white + 0.2*Hispanic+ other variables. – user122514 Aug 19 '21 at 16:040.1and0.2are completely arbitrary and were only used to show you what I meant. – user122514 Aug 19 '21 at 16:07logit(p) = intercept + 0.1*White + 0.2*Hispanic + other variables, then to calculate the logit for people who are Black/Asian/Other, we would havelogit(p) = intercept + 0.1*0 + 0.2*0 + other variables = logit(p) = intercept + other variables. I'm not sure why you are saying I'm dropping their data. – user122514 Aug 19 '21 at 16:20