I have a dataset of ~4.7K records focused on binary classification with 60 features. class 1 is of 1554 records and class 2 is of 3558 records.
Now I would like to find the risk factors that influences the outcome.
I understand that people do matching to ensure that both the classes have similar distribution, so that the comparison results are reliable.
1) Are all my independent variables X1,X2...Xn called as exposure variables?
2) I see people usually do matching based on demographics like Age etc. Is it to infer what factors really influence the outcome if we keep Age constant. Am I right to understand this way?
3) If I put all the variables in logistic regression model, doesn't that account for confounding? Why do I have to do matching?
4) Out of 60 features, I would like to do matching based on 4 variables. How do I do this for my full dataset? Is there any python package to do this?
Can someone help me on how to do this?
multi variable logistic regression, I should do some feature reduction/filtering techniques/selection techniques before I pass all variables in to the model. Am I right? But when I do feature selection, let's say out of 60 features, I get some 6-7 features which gives high accuracy, so do I have to choose only those 6 features? Does that mean the feature selection algorithm has taken care of confounding? – The Great Jan 10 '20 at 08:36