I am trying to train models to predict if somebody will get breast cancer- it is a binary classification problem, using limited features that replicate data a patient's primary care physician will have on them.
I have the following steps:
- Fill in missing values - 'mean' for continuous columns, '10' for discrete columns (discrete columns go up to 0,1,2,3,4,5,6)
- Feature selection (correlation based, whole dataset)
- Test train split (20/80)
- SMOTENC to balance classes (previously was 4000 vs 50,000) on train data only
- One hot encode the discrete columns in test and train
- Standard scale in test and train
- Then --> 10 fold cross validation to determine best hyperparameters for KNN (1-30), and Logistic Regression Classifier (['solver'] ,['penalty'] ,['C'] ) I am getting decent AUROC on the train data (0.7 on LR, 0.85 on KNN) but it seems to have completely overfit, as I am getting 0.5 AUROC on test data.
I believe the probelem is SMOTE.I have very class imbalanced data (7:93), and using SMOTE seems to make it overfit. Without SMOTE, they are just poor models and I get about 52% AUROC on test and train. I am at a loss as to what I can do next or change about what i've done. Any suggestions, pages to read, or advice in general would be greatly appreciated. I am relatively new to ML and this is for a large university project, and I cannot work out whether it is my data that is the problem- that it just won't train perhaps due to the minimal features I have, or something I am doing. I am currently training a Random Forest model as I have read that deals with imbalanced data better, and also plan to have a look into XGBoost.