Reliability Logistic Regression - train and evaluate the model

Question

I have built an Logistic Regression model in R. The class that I want to predict, is very unbalanced (99 vs 1).

My first finding is that this Logistic model does a better job if I train it on a balanced (50 - 50) train set, instead of on the whole train set (unbalanced (99 - 1). This is, seen the sources on the internet, a common way to deal with unbalanced data (source).

But I have doubts about the next two questions:

First Question: To decide my models performance, I used a confusion matrix. I played around with the threshold (when a prediction is classified as 1, or as 0). See this code:

predictions <- predict(mylogit, test_set, type = "response")
confusionMatrix(data = as.numeric(predictions > 0.5), test_set$target)
# Here I played around with the 0.5 cutoff, to decide when 
# my model is performing best on my test set.

So I tweaked the 0.5 until I had the best confusion matrix score on my test set. Is this valid?

In addition,Second question: If my results (in terms of prediction - Confusion matrix) are bad, can I still use the model to see the influence of the factors (coefficients) on the target? So only for describing the data, and not for predicting? And why?

I think you will find the information you need in the linked thread. Please read it. If it isn't what you want / you still have a question afterwards, come back here & edit your question to state what you learned & what you still need to know. Then we can provide the information you need without just duplicating material elsewhere that already didn't help you. — gung - Reinstate Monica, Aug 24 '18 at 14:40
Thanks, but the link doesn't make any sense. It is about why Accuracy is not a good measurement method. And I am fully aware of that, that is why I use a confusion matrix (which provide insights in sensitivity, specificity, accuracy etc.). But that has nothing to do with my questions, or are they maybe unclear (they are 1) about the methodology that I used, and 2). about using a bad performing model for insights in your data). — R overflow, Aug 24 '18 at 14:46
You have to read between the lines a bit. The problems with the confusion matrix are the same as with accuracy. You should not use the confusion matrix. More specifically, you should not use an artificial threshold. You should use neither .5, nor a threshold shifted away from there. You should use an appropriate metric. In addition, you should not use an artificially balanced sample. That is very common & very bad practice in machine learning. Use the unaltered dataset, use a valid metric (eg, Brier score), & use a valid method (eg, CV). — gung - Reinstate Monica, Aug 24 '18 at 15:25
Thanks for your help @gung. But I am confused now, a part of the confusion matrix are precision and recall, and they are often used as evaluation metric for unbalanced ML problems (https://towardsdatascience.com/what-metrics-should-we-use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba).
In addition, "you should not use an artificially balanced sample". Do you have some background / examples why we should not do that? As in my post and on datacamp.com it looks like a common way to train ML models... (before evaluating them on a test set. Many thanks in advance! — R overflow, Aug 27 '18 at 09:17

Reliability Logistic Regression - train and evaluate the model

0 Answers0