SMOTE and Sequential Feature Selection Order

Question

Good morning,

I am doing the following procedure:

Split a Train a Test Dataset

X_train, X_test, y_train, y_test = train_test_split(
X_pre, 
y, 
random_state=0, 
stratify=y,
train_size=training_fraction
)

Apply SMOTE or Some other Balancing Algorithm

X_imputed_train_df, y_train = balancing_algorithm.fit_resample(
X_imputed_train_df, 
y_train
)

Apply Sequential Feature Selection

sss = StratifiedShuffleSplit(n_splits=8, test_size=0.2, random_state=42)
                        sfsLR = SFS(estimator=lr1, 
                                   k_features='best',
                                   forward=boolean_sfs, 
                                   floating=False, 
                                   scoring='f1',
                                   cv=sss)

sfs = sfs.fit(X_ADASYN3, labels6)
selected_feature_indices = list(sfs.k_feature_idx_)
sfsFinal = X_ADASYN3.iloc[:, selected_feature_indices]
sfsFinal.columns = X_ADASYN3.columns[selected_feature_indices]

Do HyperParameter Tuning

pipe_lr = Pipeline([('lr2',lr2)])
pipe.fit(sfsFinal, labels6)
pipe_lr = Pipeline([('lr2',lr2)])
lr_grid_search = GridSearchCV(estimator=pipe_lr,
                            param_grid=lr_param_grid,
                            scoring='f1',
                            cv=sss)

My question is: I am applying SMOTE to my full dataset, but not doing it for just the training sets of the cross validation. Is this a must? If so, how could I incorporate it in my code?

Thank you very much.

Welcome to Cross Validated! Any kind of “trick” you pull should exclude the test data, though you will find a dim view of SMOTE here. Perhaps you could say why you want to use SMOTE at all. — Dave, Jan 30 '24 at 04:00
Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Stephan Kolassa, Jan 30 '24 at 06:53
SMOTE is a completely invalid method, and any wish to use it implies a lack of understanding of statistics and prediction. More here. SMOTE will also make classification fail to work on new datasets that are not smitten. — Frank Harrell, Jan 30 '24 at 13:37
Thank you all for your answers! I see SMOTE is viewed badly. I will go in detail through the referenced posts. The reason I have been using it, is that it gives overall better F1 scores when I am experimenting with my model (By changing the scalers, imputers, etc.). Isnt the end-point getting an F1 score from your test data? I have used Stratified Shuffle Split for Sequential Feature Selection and Hyperparameter tuning. — Andres Portocarrero, Jan 31 '24 at 05:01
Isnt the end-point getting an F1 score from your test data? See what you think about that after you read the link from Harrell or Kolassa. :) — Dave, Jan 31 '24 at 10:10

SMOTE and Sequential Feature Selection Order

0 Answers0