1

I fit a logistic regression model with an unbalanced population in R.

The problem that I am getting is I have 0.4 for precision and 0.0018 for recall, so I want to modify the threshold in order to get close both indicators (precision and recall)

Do you have any function in R to modify the cutoff? I have seen some work arounds in Python, but the code that I need is in R.

jfcb
  • 11
  • 1
  • 1
    Questions that are only about software (e.g. error messages, code or packages, etc.) are generally off topic here. If you have a substantive machine learning or statistical question, please edit to clarify. – gung - Reinstate Monica Mar 01 '20 at 16:25
  • The docs suggest that predictions give probabilities. As such, this question actually has nothing to do with glm. I think you may want to ask a question "How do I round a value in [0,1] according to a given cut-off value $x \in [0,1]$ so that anything less than $x$ returns 0 and anything greater than $x$ returns 1?" If you need to do that in R, it probably belongs on StackOverflow. – Him Mar 02 '20 at 17:11
  • 1
    Yes, my question is that "How do I round a value in [0,1] according to a given cut-off value x∈[0,1] so that anything less than x returns 0 and anything greater than x returns 1"? I really need that!, THanks – jfcb Mar 03 '20 at 18:54

1 Answers1

1

Don't use thresholds at all.

Don't use precision and recall. Every criticism that applies to accuracy applies equally to precision and recall.

Unbalanced datasets are not a problem if you use appropriate quality measures (i.e., not accuracy, precision or recall).

If you still feel you need to work with thresholds, simply use predicted probabilities with predict(..., type="response") (see ?predict.glm) and compare them with your threshold.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Stephan Kolassa
  • 123,354
  • "if you use appropriate quality measures" Would you suggest anything in particular? Although analysis via a single cut off point may be quite narrow, the set of confusion matrices at every cut off contains a lot of information about the performance of a classifier. For binary classification problems, accuracy, precision and recall are essentially the confusion matrix. – Him Mar 01 '20 at 17:27
  • @Scott: proper scoring rules are the tool of choice, see the first two links. Per both links, I do not believe confusion matrices are useful. They are improper and misleading, and their very simplicity makes them doubly dangerous. – Stephan Kolassa Mar 01 '20 at 17:31
  • Neither of those links seem to suggest anything. They merely repeat your claim that confusion matrices are not useful. – Him Mar 01 '20 at 17:38
  • Indeed, the answer to this question "What are the consequences of deciding to treat a new observation as class 1 vs. 0? Do I then send out a cheap marketing mail to all 1s? Or do I apply an invasive cancer treatment with big side effects?" goes hand-in-hand with the contents of the confusion matrix since it tells you how many people without cancer are getting your invasive cancer treatment. – Him Mar 01 '20 at 18:59
  • Do you have some example code of:

    "If you still feel you need to work with thresholds, simply use predicted probabilities with predict(..., type="response") (see ?predict.glm) and compare them to your threshold"

    Thanks a lot in advance

    – jfcb Mar 01 '20 at 19:38
  • @Scott: have you looked at my answer at the second thread I linked? There are multiple paragraphs on scoring rules as alternatives to accuracy, and as I write, the exact same criticisms that apply to accuracy apply equally to precision and recall, and therefore also to the confusion matrix. Also, the first link explicitly addresses that very often we will not have two possible actions (treating a case as "positive" vs. "negative"), but more (collect more data if we are unsure). – Stephan Kolassa Mar 02 '20 at 09:27
  • @josecorti: predict.glm(..., type="response") will give you probabilistic predictions, i.e., numbers between zero and one. You can compare them to a threshold value using straightforward comparison operators like < and work with index vectors in R. If you have questions about this, then it might be best to ask a separate question at StackOverflow in the R tag, since that is not about statistics any more. Feel free to link to that new question here, then I'll try to take a look. – Stephan Kolassa Mar 02 '20 at 09:30
  • @Scott: on the shortcomings of the confusion matrix, here is another example. Yes, I do write about this often, I'll admit it's a bugbear of mine. – Stephan Kolassa Mar 02 '20 at 09:32
  • Many of your criticisms are fair. However, I think that it is often necessary to use the confusion matrix precisely because of the reasons that your answers say not to. Very frequently, we make a decision based on the outcome of a classifier, and the intermediate probability calculation is irrelevant to how the model ends up being used. Arguably, people should take the pseudo-probabilistic output into account somehow, but ML systems being what they are these days, that rarely happens: the machine makes a decision totally independently of any human input. – Him Mar 02 '20 at 14:50
  • It's also worth noting that many of the criticisms of accuracy are of using accuracy only. The same apply to using recall and precision only, but go away when one considers the entire confusion matrix. Not all criticisms, but one can never address every possible criticism of anything, I suppose. – Him Mar 02 '20 at 15:05