1

I am new to data science & machine learning

I am using Weka platform to work on a classification problem with an imbalanced dataset. My question is: can I apply a feature selection method to a balanced copy of the dataset then I use the resulted subset of features in the original dataset (imbalanced)?

I my question is not clear, I will explain it by the following detail steps:

  1. I made two copies of the dataset, the original imbalanced dataset and the balanced dataset.
  2. I applied a feature selection method to the balanced copy. (at the end of this step, a subset of features is selected)
  3. In the imbalanced copy, I retained the selected features and removed the unselected ones.

For example: Assume that you have an imbalanced dataset with 5 features: a, b, c, d, and e features. You balanced the entire dataset. Then you applied a feature selection method. You got three selected features: a, b, and c. After that, you went back to the imbalanced dataset (the original one) and removed d and e features. Then you completed your procedures on the imbalanced dataset with a, b, and c features.

Is this procedure correct?

Muneera
  • 11
  • How did you get the balanced dataset? Why do you want to do the above? – Tim Feb 10 '23 at 08:31
  • By using oversampling technique. Because I want to apply these two things:
    1. cross validation to the imbalanced dataset.
    2. feature selection on the same dataset but after balancing it.
    – Muneera Feb 10 '23 at 08:43

1 Answers1

1

It is usually best to train a model on a sample that is close in distribution to the population you later want to apply the trained model to. Of course, some deviations between the distributions will always happen, simply by sampling variance. However, strong differences between the two datasets can lead to a model that is biased, i.e., systematically wrong. This can happen, for instance, if you purposely pick a training sample that is balanced (often called "oversampling the minority class" and similar). See Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Stephan Kolassa
  • 123,354
  • I have one imbalanced dataset (with about 1500 instances and 35 features). It can be considered as a sample. I made my experiment on it. I balanced it then applied a FS method. Then I removed the unselected features from the imbalanced dataset. – Muneera Feb 10 '23 at 07:43
  • I edited my question and added an example. – Muneera Feb 10 '23 at 07:52