Good morning,
I am doing the following procedure:
Split a Train a Test Dataset
X_train, X_test, y_train, y_test = train_test_split(
X_pre,
y,
random_state=0,
stratify=y,
train_size=training_fraction
)
Apply SMOTE or Some other Balancing Algorithm
X_imputed_train_df, y_train = balancing_algorithm.fit_resample(
X_imputed_train_df,
y_train
)
Apply Sequential Feature Selection
sss = StratifiedShuffleSplit(n_splits=8, test_size=0.2, random_state=42)
sfsLR = SFS(estimator=lr1,
k_features='best',
forward=boolean_sfs,
floating=False,
scoring='f1',
cv=sss)
sfs = sfs.fit(X_ADASYN3, labels6)
selected_feature_indices = list(sfs.k_feature_idx_)
sfsFinal = X_ADASYN3.iloc[:, selected_feature_indices]
sfsFinal.columns = X_ADASYN3.columns[selected_feature_indices]
Do HyperParameter Tuning
pipe_lr = Pipeline([('lr2',lr2)])
pipe.fit(sfsFinal, labels6)
pipe_lr = Pipeline([('lr2',lr2)])
lr_grid_search = GridSearchCV(estimator=pipe_lr,
param_grid=lr_param_grid,
scoring='f1',
cv=sss)
My question is: I am applying SMOTE to my full dataset, but not doing it for just the training sets of the cross validation. Is this a must? If so, how could I incorporate it in my code?
Thank you very much.
Isnt the end-point getting an F1 score from your test data?See what you think about that after you read the link from Harrell or Kolassa. :) – Dave Jan 31 '24 at 10:10