1

I'm using a binomial GLM to model what kind of student would pass or fail a certain class. When I use predict with type "response" on my model, I see a vector of probabilities. Per my understanding, probabilities near 1 = passing while probabilities near 0 = failing. My book initially said to use a cutoff of 0.5 to determine pass or fail. Essentially, if the predicted probability is greater than 0.5, then we say pass, and if less than 0.5, then fail.

Then, the book said that there may be a more accurate way to gauge the cutoff. I thought I should generate a vector of random numbers ~ uniform[0,1] the same length as my data. Then I would compare each predicted probability and see if it's greater than the respective randomly generated uniform number. If what I'm doing correct or even necessary? I'm not exactly sure why we shouldn't use 0.5 in the first place.

  • 1
    If you're modeling a low probability outcome and you use a cutoff of .5, then all your predictions will be zero, so there can be no such general rule. First question is why do you need a cutoff? – Heteroskedastic Jim Nov 18 '18 at 02:56
  • I'm using response as the prediction type which gives me probabilities, but the target variable is binomial ("Fail" or "Pass"). – mistersunnyd Nov 18 '18 at 20:27
  • 1
    The target variable is fail/pass but why do you need to go from probabilities to fail/pass? The question remains. What is your goal when you attempt this? – Heteroskedastic Jim Nov 18 '18 at 20:31
  • Hmm, is there a method for R to output just the fail/pass result without any probabilities even if I'm using "response" as the prediction type? – mistersunnyd Nov 18 '18 at 20:33
  • 1
    There is no canned method that is defensible. That is why I keep asking for your motivations for doing this. Any defensible method for turning probabilities into 0/1 has to be context-dependent. – Heteroskedastic Jim Nov 18 '18 at 21:04
  • I am predicting a binomial GLM, and the only output of the predictions are in probabilities (as far as I know). Since the target variable is either 0/1, I have to manipulate the probabilities in order to get the 0/1 result that I desire. I must have some kind of cutoff so that each probability becomes 0/1. I don't know if there's another way do get what I need. – mistersunnyd Nov 18 '18 at 21:59
  • You say you "must", I now believe that you believe you "must" but without some rationale, I don't believe you "must". Logistic regression is a model for the probabilities underlying the Bernoulli outcome. Beyond that, logistic regression does not help. Are you trying to assess prediction accuracy? Classify people based on your model? What is your goal? If you have a goal in mind, please edit it into the question. – Heteroskedastic Jim Nov 18 '18 at 22:08
  • My goal is classification. Given a bunch of information about students in a school (age, gender, family, extracurricular, etc.), I want to predict whether a student would fail or pass a certain class. I've also ran a decision tree to model the outcome. I want to use a GLM just as another option for a model. – mistersunnyd Nov 18 '18 at 22:10
  • I think your question already has an answer here: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models. Also see http://www.fharrell.com/post/classification/ as a starting point if you really need to classify. – Heteroskedastic Jim Nov 18 '18 at 22:22
  • I completely agree with what the two posts say, but the target variable that I'm provided is already binary. Therefore, I'm not sure how I would run predictions given that my results are in probabilities. What if I do this: since about 2/3 of the students pass (given), if the prediction probability is > 2/3, then I assign that as a pass? – mistersunnyd Nov 19 '18 at 00:13
  • both those articles are referring specifically to the binary outcome situation. Why do you need classification so much? If you read those articles, then you'd know that's an inadequate approach. – Heteroskedastic Jim Nov 19 '18 at 01:25

1 Answers1

1

You need to generate an ROC curve. A 0.5 cutoff is often the default but not necessarily the best cutoff.

HEITZ
  • 1,772