1

Im currently hyper parameter tuning my model and returning the model with the least amount of error. Before I start the hyper parameter tuning process I ensure my validation and test data is is weighted correctly by removing columns they may occur the most. This is that code

#Get the weight
vali_weight = np.unique(y_validation, return_counts=True)[1]
test_weight = np.unique(y_test, return_counts=True)[1]

#Calculate how many need to removed vali_remove_count = vali_weight[0] - vali_weight[1] test_remove_count = test_weight[0] - test_weight[1]

#Re-merge data #Validation xv = X_validation.copy() xv["TARGET"] = y_validation xv = xv.drop(xv.query('TARGET == 0').sample(vali_remove_count).index)

#Test xt = X_test.copy() xt["TARGET"] = y_test xt = xt.drop(xt.query('TARGET == 0').sample(test_remove_count).index)

#Re-split data y_validation = xv["TARGET"] xv.drop(columns=["TARGET"], inplace=True) X_validation = xv.copy()

y_test = xt["TARGET"] xt.drop(columns=["TARGET"], inplace=True) X_test = xt.copy()

#Get the weight vali_weight = np.unique(y_validation, return_counts=True)[1] test_weight = np.unique(y_test, return_counts=True)[1]

In terms of the training data im using sample weights during the training process

sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)

After this step is complete i train another model with the best parameters found during the tuning to validate everything is correct.

clf=XGBClassifier(objective = "binary:logistic",
                  booster="gbtree",
                  max_depth = bp['max_depth'], 
                  gamma = bp['gamma'],
                  max_leaves = bp['max_leaves'],
                  reg_alpha = bp['reg_alpha'],
                  reg_lambda = bp['reg_lambda'],
                  colsample_bytree = bp['colsample_bytree'],
                  min_child_weight = bp['min_child_weight'],
                  learning_rate =  bp['learning_rate'],
                  n_estimators = 200,#bp['n_estimators'], 
                  subsample =  bp['subsample'],
                  random_state = bp['seed'])

sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)

evaluation = [(x_train, y_train), (x_validation, y_validation)] clf.set_params( eval_metric=['aucpr', 'logloss'], early_stopping_rounds=100 ).fit(x_train, y_train, sample_weight=sample_weights, eval_set=evaluation, verbose=True)

train_pred = clf.predict(x_train) vali_pred = clf.predict(x_validation) test_pred = clf.predict(x_test)

train_err = mean_absolute_error(y_train, train_pred) train_auc = accuracy_score(y_train, train_pred) vali_err = mean_absolute_error(y_validation, vali_pred) vali_auc = accuracy_score(y_validation, vali_pred) test_err = mean_absolute_error(y_test, test_pred) test_auc = accuracy_score(y_test, test_pred) print(f"Train MAE: {train_err}") print(f"Train ACC: {train_auc}") print("--------------------------") print(f"Validation MAE: {vali_err}") print(f"Validation ACC: {vali_auc}") print("--------------------------") print(f"Test MAE: {test_err}") print(f"Test ACC: {test_auc}") print("--------------------------") print(classification_report(y_test, test_pred))

I am consistently getting very little to no movement on my validation logloss but i can see my training data is doing as expected. Without looking at my data (its private) what could be the cause of this issue?

Logloss plot (Blue = Training) (Orange = Validation)

0 Answers0