0

Good morning,

I am doing the following procedure:

Split a Train a Test Dataset

X_train, X_test, y_train, y_test = train_test_split(
X_pre, 
y, 
random_state=0, 
stratify=y,
train_size=training_fraction
)

Apply SMOTE or Some other Balancing Algorithm

X_imputed_train_df, y_train = balancing_algorithm.fit_resample(
X_imputed_train_df, 
y_train
)

Apply Sequential Feature Selection

sss = StratifiedShuffleSplit(n_splits=8, test_size=0.2, random_state=42)
                        sfsLR = SFS(estimator=lr1, 
                                   k_features='best',
                                   forward=boolean_sfs, 
                                   floating=False, 
                                   scoring='f1',
                                   cv=sss)

sfs = sfs.fit(X_ADASYN3, labels6) selected_feature_indices = list(sfs.k_feature_idx_) sfsFinal = X_ADASYN3.iloc[:, selected_feature_indices] sfsFinal.columns = X_ADASYN3.columns[selected_feature_indices]

Do HyperParameter Tuning

pipe_lr = Pipeline([('lr2',lr2)])
pipe.fit(sfsFinal, labels6)
pipe_lr = Pipeline([('lr2',lr2)])
lr_grid_search = GridSearchCV(estimator=pipe_lr,
                            param_grid=lr_param_grid,
                            scoring='f1',
                            cv=sss)

My question is: I am applying SMOTE to my full dataset, but not doing it for just the training sets of the cross validation. Is this a must? If so, how could I incorporate it in my code?

Thank you very much.

Dave
  • 62,186

0 Answers0