Please teach me if I am wrong. The appropriate order should be:
- SMOTE
- Feature selection (e.g., by using a wrapper method)
- Model selection (e.g., by selecting the model with highest AUC)
Then evaluate the performance of that model on the test set.
Thank you.
Thank you so much @DikranMarsupial for the useful comments. I would like to answer my question.
The appropriate order is:
Sampling data (for example, I split the data into training set and test set, stratifying by the outcome. Then, a sampling method such as oversampling, undersampling, or SMOTE may be performed on the training set).
Feature selection: by combining selectors Below is the code in an online course that I imitate:
2a. First, selection with RandomForest
from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier
rfe_rf = RFE(estimator = RandomForestClassifier(), n_features_to_select = 12, verbose = 1)
rfe_rf.fit(X_train, y_train) rf_mask = rfe_rf.support_
2b. Then with a gradient boosting classifier
from sklearn.feature_selection import RFE from sklearn.ensemble import GradientBoostingClassifier rfe_gb = RFE(estimator = GradientBoostingClassifier(), n_features_to_select = 12, verbose = 1) rfe_gb.fit(X_train, y_train) gb_mask = rfe_gb.support_
2c. Finally, count the votes
import numpy as np
votes = np.sum([rf_mask, gb_mask], axis = 0)print(votes)
- Continue with the selected variables.
I think the above sequence is appropriate because the model chosen in step 3 is based on a valid (more balanced) dataset. However, I am concerned that the resampling dataset is only balanced in terms of the outcome; therefore, it might be better if I drop some variables before performing the resampling technique.
– sinhvienhamhoc Jun 03 '22 at 05:02