2

everyone!

I am a newbie on machine learning, and I am now interested on classification modeling.

I used logistic regression, linear discriminant analysis (LDA), and naive Bayes on my notebook DataCamp Certification - Travel Insurance as a tool to exploratory data analysis. A thing that caught my attention is the difference between AUC of the three models, more specific between logistic regression and the other tow techniques. I use the Yardstick to calculate this measure.

Well, I am trying to apply this same strategy on Titanic - Machine Learning from Disaster. When I model on my notebook the data with logistic regression, I was surprised by a AUC very low (0.1464) despite my score of 0.7584 on the competition.

It is important to say that some terms on the first notebook show a low p.value. Differently, the second notebook, all terms show p.value < 0.05.

So, I would like to understand this situation because it is worries me a lot!

Thank you all for your time!

  • Welcome to CV.SE. 1. Please note that values of AUC-ROC below 0.50 suggest that the label during evaluation are likely inverted.
  • – usεr11852 Apr 21 '22 at 00:37
  • 2
    An AUC less than 0.5 means the model could do better by just switching the predicted labels. Are you sure there is no bug in your code which may code positive responses as 0 and negative responses as 1? If your outcome is a factor, this can happen easily. – Demetri Pananos Apr 21 '22 at 02:59
  • @Gregory Oliveira Your links are broken. – frank Apr 21 '22 at 07:12
  • @frank I changed them. I try to send you directly to the cells that I want. – GregOliveira Apr 21 '22 at 10:13
  • I'll inspect again, but I had tried to look at this issue of changing labels before. If successful, I'll let you know here. Thank you! – GregOliveira Apr 21 '22 at 10:16
  • The concordance probabilty (aka $c$-index and AUROC) is not sensitive enough for comparing two models as it does not sufficiently reward extreme predictions that are correct. See this for sensitive measures. – Frank Harrell Apr 21 '22 at 11:04
  • @FrankHarrell I’m getting a “page not found” error. Is there a typo in the link URL? – Dave Apr 21 '22 at 11:09
  • You mean p-values for the confidence of factors? I can't say for sure, because I don't see the code here (yes please, provide all of the materials in the question for futher convinience), but when default hypothesis for coefficients of linear regression (or logistic) is that coeffs. are confident, the less p-value the better. – taciturno Apr 21 '22 at 11:32
  • @Dave links are working here... :-/ Try to go with my Kaggle profile... https://www.kaggle.com/gregoryoliveira – GregOliveira Apr 21 '22 at 12:46
  • @taciturno I just mention the p.value because this maybe a important information about the influence on AUC. – GregOliveira Apr 21 '22 at 12:47
  • @usεr11852 I inspect the Titanic notebook and the factors are correct. :-/ – GregOliveira Apr 21 '22 at 12:48
  • @GregoryOliveira that's no need to put your whole notebooks or links on notebooks -- it's just not convinient and makes problem too broad to answer. Please, make a new question, specify a problem, put few line of code right there – taciturno Apr 21 '22 at 13:14
  • You did a good job (+1) just the function used had a bit of an issue. I comment this further in my answer below. – usεr11852 Apr 21 '22 at 14:33