The order of SMOTE, Feature selection, Model selection?

Question

Please teach me if I am wrong. The appropriate order should be:

SMOTE
Feature selection (e.g., by using a wrapper method)
Model selection (e.g., by selecting the model with highest AUC)

Then evaluate the performance of that model on the test set.

Thank you.

Thank you so much @DikranMarsupial for the useful comments. I would like to answer my question.

The appropriate order is:

Sampling data (for example, I split the data into training set and test set, stratifying by the outcome. Then, a sampling method such as oversampling, undersampling, or SMOTE may be performed on the training set).
Feature selection: by combining selectors Below is the code in an online course that I imitate:

2a. First, selection with RandomForest

from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier

rfe_rf = RFE(estimator = RandomForestClassifier(), n_features_to_select = 12, verbose = 1)

rfe_rf.fit(X_train, y_train) rf_mask = rfe_rf.support_

2b. Then with a gradient boosting classifier

from sklearn.feature_selection import RFE from sklearn.ensemble import GradientBoostingClassifier rfe_gb = RFE(estimator = GradientBoostingClassifier(), n_features_to_select = 12, verbose = 1) rfe_gb.fit(X_train, y_train) gb_mask = rfe_gb.support_

2c. Finally, count the votes

import numpy as np
votes = np.sum([rf_mask, gb_mask], axis = 0)

print(votes)

Continue with the selected variables.

You're asking for a fish instead of asking to learn to fish. Can you explain why you think this is the right order, and why an alternative ordering seems wrong to you? — Arya McCarthy, Jun 03 '22 at 04:43
Welcome to Cross Validated! Why do you think you need to do SMOTE or feature selection? — Dave, Jun 03 '22 at 04:53
Thank you for asking me to elaborate my question.
I think the above sequence is appropriate because the model chosen in step 3 is based on a valid (more balanced) dataset. However, I am concerned that the resampling dataset is only balanced in terms of the outcome; therefore, it might be better if I drop some variables before performing the resampling technique. — sinhvienhamhoc, Jun 03 '22 at 05:02
@Dave Thank you for welcoming me. I do feature selection to exclude less important variables. This will help reduce over fitting and improve the accuracy of the model. And the training set should be more balanced (by resampling technique) so that the results will be more reliable. — sinhvienhamhoc, Jun 03 '22 at 05:17
@sinhvienhamhoc if you use cross-validation for feature or model selection, you need to perform the SMOTE independently in each fold of the cross-validation to prevent information about the test partition leaking into the training partition, see https://stats.stackexchange.com/questions/346534/smote-data-balance-before-or-during-cross-validation/577520#577520 However, SMOTE is unlikely to improve the performance of a modern classifier, such as the SVM, provided the hyper-parameters are tuned properly. Why do you need SMOTE? — Dikran Marsupial, Jun 03 '22 at 05:30
Also, for modern classifiers, feature selection is quite likely to make performance worse rather than better (see https://stats.stackexchange.com/questions/2306/feature-selection-for-final-model-when-performing-cross-validation-in-machine/2317#2317 ) as regularisation is a better way to prevent the over-fitting due to uninformative attributes. — Dikran Marsupial, Jun 03 '22 at 05:33
Can you explain why AUC is the appropriate performance metric for your application? — Dikran Marsupial, Jun 03 '22 at 05:34
@Dikran Marsupial Thank you for your comments. First comment: The data has much more controls than cases so I need SMOTE as well as other sampling methods to deal with the fact that my data has much more controls than cases. When the dataset is imbalanced, it seems that the model tends to group an observation into the major group. I do not perform cross-validation for feature or model selection, however, you will perform SMOTE before the selection, won't you? — sinhvienhamhoc, Jun 03 '22 at 12:09
@DikranMarsupial Third comment: I think all performance metrics are calculated based on the confusion matrix; it will change when we change the threshold, but the AUC will not. For that reason, I use AUC values to choose the final model. And I will report other metrics as well. I am still think of your second comment. Thanks and happy weekend. — sinhvienhamhoc, Jun 03 '22 at 12:10
@sinhvienhamhoc having an imbalanced dataset does not mean you need to use resampling. Most classification methods have no problem with imbalanced datasets. The reason for performing resampling is because misclassification costs are not equal, which is why I asked what metric was important. I asked what metric is important for your application, it is best to choose the metric according to the needs of the task, not technical difficulties with methods. Do you know the costs of false-positives and false-negatives? — Dikran Marsupial, Jun 03 '22 at 14:04
BTW if you are uncertain about misclassification costs, I'd strongly advise using a probabilistic classifier rather than e.g. an SVM. If you are not using cross-validation, how are you performing the model and feature selection? — Dikran Marsupial, Jun 03 '22 at 14:05
@DikranMarsupial Thank you for your comments. I will study the costs of false-positives and false-negative, choose the appropriate metrics with my model, then use them (instead of AUC) for model selection. — sinhvienhamhoc, Jun 05 '22 at 03:34
But I have no idea about your comment that "most classification methods have no problems with imbalanced datasets." Would you please let me know where I can find the reference for it? — sinhvienhamhoc, Jun 05 '22 at 03:38
And, actually, I intend to use logistic regression model (not SVM) because I want to draw a nomogram using the RMS package. Not using cross-validation, I just select the features with 0-fold cross-validation (is that right?). — sinhvienhamhoc, Jun 05 '22 at 03:45
@sinhvienhamhoc see my questions here: https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance and here: https://stats.stackexchange.com/questions/559294/are-there-imbalanced-learning-problems-where-re-balancing-re-weighting-demonstra Even when I offered a modest bonus, nobody could show how to diagnose a problem due to imbalance and nobody could give a reproducible example of imbalance negatively affecting accuracy. Generally if the classifier ignores the minority class it is because that is the optimal solution. — Dikran Marsupial, Jun 05 '22 at 07:41
I don't know what you mean by 0-fold cross-validation. If you mean feature selection on the training set, then it is very likely you will make the performance of your model worse by overfitting the selection criterion. If you are going to do feature selection, you need to have an unbiased performance estimate to determine when to stop and whether you are making your model better or worse. — Dikran Marsupial, Jun 05 '22 at 07:43
@DikranMarsupial I apologize for the late response to your comment. I don't quite understand what you said, for example, regarding the selection of variables, according to the articles I read, the authors will rely on p-values or a step-wise approach. Then, when I read more literature, I realized that there is a difference between inference and prediction in statistics. And what you say seems to be machine-learning... I must study much more... — sinhvienhamhoc, Jun 08 '22 at 05:26
@sinhvienhamhoc when you perform feature selection, no matter how you do it, you will be optimising some statistic evaluated over a finite dataset. If there is any noise in that data (or sampling variation), then there is a chance that the statistic may be optimised in ways that exploit the noise/variation rather than genuinely improve the model of the underlying structure of the data. This is the same for prediction or inference. The more degrees of freedom you have, the easier it is for this type of over-fitting to occur. Feature selection involves on DoF per feature, — Dikran Marsupial, Jun 08 '22 at 08:44
which means it is vulnerable to making your model worse rather than better. Ridge regression on the other hand has only one (continuous) additional DoF, and tends to be more robust. This recommendation can be found in Millar's monogram on feature subset selection. Just because others rely on p-values and step-wise, does not mean it is a good idea, or that it will improve your model. — Dikran Marsupial, Jun 08 '22 at 08:46
@DikranMarsupial Thank you for your explanation. I understand the idea that feature selection may make the model worse in terms of performance metrics. — sinhvienhamhoc, Jun 08 '22 at 18:39
May I ask you about ridge regression that you mentioned? As I read hyperparameter tunning, logistic regression has 'l1' and 'l2' parameters. Is that similar to what you said? — sinhvienhamhoc, Jun 08 '22 at 18:45
@yes, that is correct L2 is the equivalent of ridge regression and L1 is essentially equivalent to the LASSO method. Both are very useful tools to have in the toolbox! — Dikran Marsupial, Jun 08 '22 at 20:50

The order of SMOTE, Feature selection, Model selection?

0 Answers0