Issue with training on classification metrics other than accuracy (using R and caret)

Question

I have a binary classification problem with two classes 0 and 1. For training an XGBoost classification model, I apply a balanced data set (50% 0's, 50% 1's).

In reality, 1's are much more abundant than 0's. After applying my newly generated model on some realistically distributed test data, I see very solid recall numbers but poor precision for the less abundant 0-class.

To mitigate this effect, I tried to apply other optimization metrics. In particular, I was very much interested in optimizing precision, F1 or ROC.

For ROC I used the following code:

# load in training data (50/50)
training <- readRDS("Train_Data.rds")

# implement 3fold cv, define twoClassSummary for ROC metric
fitControl <- trainControl(method = "repeatedcv", number=3, repeats=1, classProbs=TRUE, savePredictions=TRUE, summaryFunction = twoClassSummary)

# Set-up grid search for hyperparameter tuning
tune_grid <- expand.grid(
  nrounds = seq(from = 200, to = 1000, by = 100),
  eta = c(0.025, 0.05, 0.1, 0.3),
  max_depth = c(2, 3, 4, 5, 6, 10),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

# create model
xGFit1 <- train(target~., data=training, method="xgbTree", tuneGrid = tune_grid, trControl=fitControl, metric = "ROC")

When using F1 or precision, I replaced twoClassSummary with prSummary and changed the metric in the train()-function to either "F" or "Precision".

Unfortunately, - while the absolute number vary a little bit in my confusion matrix, recall and precision values remain unchanged (when rounding to XX% without decimals).

Did I do something wrong?

When optimizing for ROC, F1 or precision is it better to use a realistically distributed training set to have an effect on the test data?

I believe that Why is accuracy not the best measure for assessing classification models? addresses your question, because every problem with accuracy that thread examines also holds for F1, precision and most other KPIs. (ROC is a slightly different topic.) I would in particular say that my advice there would be helpful. On balancing, see Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Stephan Kolassa, Apr 30 '20 at 14:16
alright this is a very good reasoning why my metrics do not deliver different results; as a short follow-up: do you have a go-to way to address a classification problem in which you try to maximize precision (as well as recall) for imbalanced data leaving aside any manipulation of the threshold value? — Arne, May 05 '20 at 06:24
To be honest, I have never thought about this, because it seems like optimizing the wrong KPI. We should aim for calibrated probabilistic predictions, not for high precision. I like analogies: maximizing precision instead of improving probabilistic predictions seems to me like keeping a patient's temperature at precisely 37°C, regardless of his overall health - even if we have to heat the corpse to do so. — Stephan Kolassa, May 05 '20 at 17:20
sorry for yet another follow-up: are you an R-user? Does not R optimize for such simple KPIs that you do not see as fit for model evaluation? In case you are using R / caret, how do you define your train-function - what is your evaluation metric? — Arne, May 20 '20 at 06:09
Yes, I use R. I do little classification, but I am not aware of any base functions that optimize for F1 score or accuracy - classifiers usually optimize the likelihood, which is a form of the log score, which is a proper scoring rule (and which I would also use in training). User-contributed packages can, of course, do something cough different. — Stephan Kolassa, May 20 '20 at 07:50

Issue with training on classification metrics other than accuracy (using R and caret)

0 Answers0