Do we need preprocessing before applying a sophisticated undersampling method?

Question

My question is around applying undersampling methods to an imbalanced, and highly dimensional dataset, with mixed data. Lets say as an example, I have 150 features, also a highly imbalanced binary target variable.

I want to apply a sophisticated undersampling method, not random undersampling. Lets say something like OSS, Tomek, CNN from Imblearn.

I have 2 questions for applying undersampling to this dataset:

1-) Do I need to run PCA to reduce number of 150 features before undersampling?

2-) Do I need preprocessing? Like scaling to numerical features and one-hot encoding to categorical ones etc?

Thanks

Are you sure you need to undersample at all? – Stephan Kolassa May 03 '23 at 14:27 — Stephan Kolassa, May 03 '23 at 14:27

score 5 · Answer 1 · answered May 03 '23 at 15:50

First, you probably do not need to do anything about the class imbalance. Much of the apparent issue with class imbalance comes from using improper scoring rules like accuracy that depend on a (typically arbitrary) threshold. If you are sick of your imbalanced problem returning that all or almost all of the predictions belong to the majority category, the first thought should be to change the threshold instead of the data. (Probably even better would be to consider one of the proper scoring rules discussed in the link, but if you must use a threshold, I suspect you can change it to suit your needs.) Further, downsampling to fix what is likely a non-problem is probably the worst idea of all. Aside from issues of experimental design, before you collect data, which is discussed in the excellent answer by Dikran Marsupial to the linked question, downsampling discards precious data to solve a non-problem. (Perhaps there could be computational reasons for downsampling to allow your dataset to fit into memory or to run in a reasonable amount of time, though the better plan might be to upgrade your hardware.)

However, even when an idea is a poor one, it can be helpful to know how it would fit into a reasonable workflow, even if that is just to show its drawbacks.

I see arguments in favor of either order.

Fiddle with the data to create new data. Then you have your synthetic data that you are treating as if they were real, and you do the rest of the modeling as if the synthetic data were real.
Do your preprocessing on the real data. Then synthesize new data based on the preprocessed real data, since the undesirable features of the data will have been removed.

The former makes the most sense to me. You are doing the rest of the modeling as if the synthetic data were real. Why not pretend the synthetic data were real when it comes to the preprocessing? This also has the advantage of assuring that your synthetic data adhere to the properties you want, whereas the latter idea allows for nastiness to creep in that you then are unable to preprocess away.

However, I do believe you can treat this as a hyperparameter and go with whatever leads to the best performance.

Overall, though, the best move is probably not to undersample (or even oversample) at all, and to use proper statistical methods on data that represent the reality of your situation.

Thanks for your detailed answer on a different perspective I didn't thought before. I will dig deeper into your suggestion. — Ali Kılınç, May 09 '23 at 08:13

Do we need preprocessing before applying a sophisticated undersampling method?

1 Answers1