I have a classification problem where the size of the dataset is about 1 million lines but the target group is only about 0.6% of the dataset. I have about 40 feature including both categorical and continuous features and some of the continuous features have correlation with each other. The way I used to do features selection for problems like this was running a random forest model on it by giving extra weight to the target group and using the feature importance module. But it's not working very well with this question. Also I've noticed that this method usually lables all categorical features as unimportant when logically some of them seem like very good features. So I wanted to see if there are better ways of doing this.
Asked
Active
Viewed 31 times
0
-
2Reweighting instances is very similar to oversampling, and there are major questions about oversampling as a practice. On the one hand, you could reduce your dimensionality among correlated continuous features using PCA. On the other hand, you may simply not have a very strong signal. When you write that some features should "logically" be very good, are you relying on domain knowledge for this judgment? Reliably picking out 0.6% requires quite a strong signal in your data. – Stephan Kolassa Jan 29 '24 at 07:35
-
Sorry I just saw your comment. The reason I say the categorical variables should be important is that in my industry they are regarded as very important features for these kind of problems. also these are binary categories and I can see that between the data that has the value of 1 for these features, the 0.6% mentioned above can go as high as 5%. – peiman razavi Jan 29 '24 at 16:35
-
2Thanks. Are you using accuracy as an evaluation metric? If the conditional probability of being the target class is 5% in the presence of a particular predictor, then the accuracy-optimizing "hard" classification is still "non-target". This thread and this thread might be helpful. – Stephan Kolassa Jan 29 '24 at 16:54