I have built an Logistic Regression model in R. The class that I want to predict, is very unbalanced (99 vs 1).
My first finding is that this Logistic model does a better job if I train it on a balanced (50 - 50) train set, instead of on the whole train set (unbalanced (99 - 1). This is, seen the sources on the internet, a common way to deal with unbalanced data (source).
But I have doubts about the next two questions:
First Question: To decide my models performance, I used a confusion matrix. I played around with the threshold (when a prediction is classified as 1, or as 0). See this code:
predictions <- predict(mylogit, test_set, type = "response")
confusionMatrix(data = as.numeric(predictions > 0.5), test_set$target)
# Here I played around with the 0.5 cutoff, to decide when
# my model is performing best on my test set.
So I tweaked the 0.5 until I had the best confusion matrix score on my test set. Is this valid?
In addition,Second question: If my results (in terms of prediction - Confusion matrix) are bad, can I still use the model to see the influence of the factors (coefficients) on the target? So only for describing the data, and not for predicting? And why?
In addition, "you should not use an artificially balanced sample". Do you have some background / examples why we should not do that? As in my post and on datacamp.com it looks like a common way to train ML models... (before evaluating them on a test set. Many thanks in advance!
– R overflow Aug 27 '18 at 09:17