Why and how to match variables using logistic regression?

Question

I have a dataset of ~4.7K records focused on binary classification with 60 features. class 1 is of 1554 records and class 2 is of 3558 records.

Now I would like to find the risk factors that influences the outcome.

I understand that people do matching to ensure that both the classes have similar distribution, so that the comparison results are reliable.

1) Are all my independent variables X1,X2...Xn called as exposure variables?

2) I see people usually do matching based on demographics like Age etc. Is it to infer what factors really influence the outcome if we keep Age constant. Am I right to understand this way?

3) If I put all the variables in logistic regression model, doesn't that account for confounding? Why do I have to do matching?

4) Out of 60 features, I would like to do matching based on 4 variables. How do I do this for my full dataset? Is there any python package to do this?

Can someone help me on how to do this?

Robert Long · Accepted Answer · 2020-01-09T20:05:22.003

1) Are all my independent variables X1,X2...Xn called as exposure variables?

The term "exposure variable" comes mainly from the epidemiology and causal inference literature. It is not completely well defined, but usually there is one (main) exposure, an outcome, and the other variables are either competing exposures, or confounders (some could be mediators but including a mediator in a multivariable regression model is usually ill advised). In your case you are interesred in prediction (classification), not inference and the usual terminology is "predictors" or "independent variables"

2) I see people usually do matching based on demographics like Age etc. Is it to infer what factors really influence the outcome if we keep Age constant. Am I right to understand this way?

Yes, because age is often a confounder so keeping it constant by matching can help to remove confounder bias.

3) If I put all the variables in logistic regression model, doesn't that account for confounding? Why do I have to do matching?

You don't have to do both. Matching is more commonly done when you are interesred in inference (explanation). You should be careful before just adding all your variables into a model without thinking. Again, since you are not interesred in inference, this is less of a concern.

4) Out of 60 features, I would like to do matching based on 4 variables. How do I do this for my full dataset? Is there any python package to do this?

This doesn't really make sense. There are many methods for variable selection. Why do you want to use matching ?

Hi thanks for the response. Upvoted. Regarding 3rd point, you mean to say for multi variable logistic regression, I should do some feature reduction/filtering techniques/selection techniques before I pass all variables in to the model. Am I right? But when I do feature selection, let's say out of 60 features, I get some 6-7 features which gives high accuracy, so do I have to choose only those 6 features? Does that mean the feature selection algorithm has taken care of confounding? — The Great, Jan 10 '20 at 08:36
You're welcome :) to properly handle confounding you need knowledge of the casual relationships. The danger in adding too many variables is overfitting (and maybe multicollinearity). You could use a method such as cross-validation to avoid overfitting. — Robert Long, Jan 10 '20 at 08:54
...confounding is much less of a problem with prediction, than inference. — Robert Long, Jan 10 '20 at 08:56
Since this is related question, thought will check with you. Can help me with this please? https://stats.stackexchange.com/questions/497837/how-to-calculate-28-day-mortality — The Great, Nov 24 '20 at 15:37

Why and how to match variables using logistic regression?

1 Answers1