I have unbalanced data so I want to oversample obs from the minority class and then apply Logistic regression to the training set. After that, I would like to perform cross-validation. My question is: When should I separate the data into training and test? After oversampling? Any help will be appreciated.
-
Welcome to Cross Validated! Statisticians do not see class imbalance as such a problem, and there is no need to solve a non-problem. It might be helpful if you say why you find the imbalance problematic. https://stats.stackexchange.com/questions/357466 https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/ https://stats.stackexchange.com/questions/558942/ https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Jun 12 '22 at 16:13
-
Thank you, Dave. The problem is that even if my model has a good accuracy score the sensitivity is 0u%. That is why I would like to use an oversampling method. – lola Jun 12 '22 at 16:16
-
The links I posted discuss in detail why accuracy is a highly problematic metric. – Dave Jun 12 '22 at 16:19
-
Yes but my question is when should I use the Rose method. For example, if I use Logistic regression then my code would look like:library(caret) crossValSettings <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE) crossVal <- train(as.factor(admit) ~ gpa + gre + rank, data = myData, family = "binomial", method = "glm", trControl = crossValSettings, tuneLength = 2) pred <- predict(crossVal, newdata = myData) confusionMatrix(data = pred, myData$admit). How can I it the ROSE method there? – lola Jun 12 '22 at 16:28
-
Proper statistical methods say that you probably shouldn’t. Please read the links. The Twitter post by Frank Harrell is the most concise. // It’s really hard to say what the best practice is for applying something that’s generally a poor practice. – Dave Jun 12 '22 at 16:46
-
1How large is the dataset? If you have sufficient data, it is possible that this is the optimal behaviour from the perspective of accuracy, see my question here: https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance . It is unlikely that resampling will make the accuracy of a logistic regression model better. If the data is small enough for maximum likelihood to be significantly biased, there is unlikely to be enough data to estimate how much resampling is required. – Dikran Marsupial Jun 12 '22 at 17:11
-
1BTW if you perform resampling it must be done in the training set for each fold of the cross-validation, but not the test partition. Otherwise the test partition is not representative of operational conditions. – Dikran Marsupial Jun 12 '22 at 17:13
-
1Note I also asked a question whether there were reproducible examples where rebalancing improves accuracy. There were no answers, even though a modest bonus was on offer. https://stats.stackexchange.com/questions/559294/are-there-imbalanced-learning-problems-where-re-balancing-re-weighting-demonstra That is some evidence that class imbalance does not justify resampling on that basis. – Dikran Marsupial Jun 12 '22 at 17:18
1 Answers
As some links I posted discuss, class imbalance usually is not a problem. Therefore, attempts to solve a non-problem are somewhere between superfluous and damaging. Consequently, it is difficult to suggest where you would apply such a technique.
However, I have applied poor practices in regression models specifically to show how they fail, so there is a place for knowing where in the workflow such a practice would occur.
Any messing with the data would come after you have set aside out-of-sample data, since out-of-sample data are there to mimic the real-world application of your model on data that might not even exist (such as Siri or Alexa being expected to be able to do speech recognition on words spoken by people who have yet to speak their first words). This is what Dikran Marsupial means in the comment about doing resampling in the training folds of a cross-validation but not the test partition.
I don’t know where exactly that is in your particular package. It might be that you will have to code the splitting and cross-validation yourself (probably not, but maybe). However, it would be cheating to do ROSE on data that are supposed to be hidden.
- 62,186