Poor model training- what next?

Question

I am trying to train models to predict if somebody will get breast cancer- it is a binary classification problem, using limited features that replicate data a patient's primary care physician will have on them.

I have the following steps:

Fill in missing values - 'mean' for continuous columns, '10' for discrete columns (discrete columns go up to 0,1,2,3,4,5,6)
Feature selection (correlation based, whole dataset)
Test train split (20/80)
SMOTENC to balance classes (previously was 4000 vs 50,000) on train data only
One hot encode the discrete columns in test and train
Standard scale in test and train
Then --> 10 fold cross validation to determine best hyperparameters for KNN (1-30), and Logistic Regression Classifier (['solver'] ,['penalty'] ,['C'] ) I am getting decent AUROC on the train data (0.7 on LR, 0.85 on KNN) but it seems to have completely overfit, as I am getting 0.5 AUROC on test data.

I believe the probelem is SMOTE.I have very class imbalanced data (7:93), and using SMOTE seems to make it overfit. Without SMOTE, they are just poor models and I get about 52% AUROC on test and train. I am at a loss as to what I can do next or change about what i've done. Any suggestions, pages to read, or advice in general would be greatly appreciated. I am relatively new to ML and this is for a large university project, and I cannot work out whether it is my data that is the problem- that it just won't train perhaps due to the minimal features I have, or something I am doing. I am currently training a Random Forest model as I have read that deals with imbalanced data better, and also plan to have a look into XGBoost.

Welcome to Cross Validated! Are you sure that your class imbalance is problematic? — Dave, Jul 25 '22 at 18:30

score 1 · Answer 1 · answered Jul 25 '22 at 18:47

Imputation should be used with care. If you fill in the mean value for any missing one, you simulate knowledge you simply don't have. Consider simply keeping a NA value if your model can deal with it, or try deleting all cases with missing values and see how your model performs then.
Select your features only based on the training data. Doing selection first and splitting later leaks data.
As Dave writes, class imbalance is usually not a problem, and SMOTE will not solve a non-problem.
Doesn't your software do the one-hot encoding by itself?
Remember to scale based only on the training data, and then applying this scaling separately to the test data. Scaling all data at once again leaks data.
kNN is a hard classifier. I would strongly recommend using a probabilistic classifier instead, especially for rare targets - see here.

Finally, you may be interested in How to know that your machine learning problem is hopeless?

Poor model training- what next?

1 Answers1