Match number of positives in unbalanced data set

Question

I am dealing with a very unbalanced binary classification problem: 1% positives, 99% negatives. Training set is around 10 million rows, 40 columns. I choose the decision threshold (cutoff) on the training set as to match the number of positives in-sample. However, when I go out of sample, I underpredict the number of positives by around 20%. What are some things I can do to fix this problem? I would like to match the number of positives out of sample.

My AUC is decently high, around 94%, but the precision and recall are around 30% (these numbers hold both in-sample and out-of-sample). I am currently not using intercept in my model (to reduce risk of overfitting) - should I change this?

I am using binary cross entropy loss for training.

I don't know if I am missing something, perhaps to look into getting better features to improve precision and recall, or it would be hard to get a good match because the problem is so unbalanced? Would it make sense to try randomized (stochastic) threshold, would that even make sense or solve anything? What other ideas can I try?

Most of my features (columns) are discrete one-hot encodings, and I can clearly see that for some of them even in sample I am not matching the number of positives well, e.g. there's one bin that overestimates and another that underestimates, is there anything I can change related to the training (e.g. the loss function) or something else as to nudge the model to match the counts in each bin (predicted vs actual) during training?

Thank you

Update: In some comments below, some people suggest I don't even need a threshold. Then, how do I obtain a binary 0 vs 1 decision, which is my ultimate goal, without using a threshold? What strategies and other approaches are there for the decision-making process?

Because at the end of the day, I need a binary decision, 0 or 1. This is the business (application) need for this project. — user623949, May 22 '22 at 02:36
Needing to make a binary decision has little to do with using a threshold during analysis. And the awful effects of tampering with the sample sizes in the sample have been discussed extensively on this site. See this and other articles on the blog. — Frank Harrell, May 22 '22 at 02:56
I see. Can you please share some other resources, what else can I do other than using a threshold? In other words, what are my steps forward after I have trained a logistic regression model, what can I do with it to arrive at a binary decision? — user623949, May 22 '22 at 03:42
For another resource, he wrote a book called Regression Modeling Strategies. I believe that a substantial proportion of data science questions posted on here (and elsewhere on the internet) discuss issues that are addressed and clarified in the book. Additionally, he has other blog posts related to this topic. // I do not understand your logic about omitting the intercept. Could you please clarify? Do you just mean that you want to lower the parameter count and are willing to sacrifice the intercept to do so? — Dave, May 22 '22 at 04:09
Yes, that was the idea, to have fewer params. I am considering adding it back, do you think it has any chance of helping materially with this problem. Sorry, I do not have that book handy at the moment. I will paraphrase my original question and try to get more input so I can make progress on my project before acquiring that book. — user623949, May 22 '22 at 13:49
Question for @FrankHarrell: I read your blog post here: https://www.fharrell.com/post/classification/ Can you please explain what you mean by "(2) recalibrating the intercept (only) for another dataset with much higher prevalence" and how can I apply it to my situation? I am still at a loss and not sure how to go about getting binary decision, i.e. how to establish a decision-making process that meets my criteria described above. — user623949, May 22 '22 at 22:54
Asking about that quote from the blog would be a wonderful question to post! // The gist of several of Frank Harrell’s blog posts is that it is best to get out of the binary world and think in terms of accurate estimates of event probability. This is extremely different from how much of machine learning is presented, yes, and Harrell addresses that on his blog, too. — Dave, May 22 '22 at 23:46
Yes, I went over some of his thoughts. They are great, sadly my application is such that I have to return to my boss a list of 0s and 1s, it comes down to that. I would really appreciate some input on how to do that based on my situation described above. — user623949, May 23 '22 at 03:11
Too bad your boss is engaged in dichotomous thinking. Optimal decisions require understanding gray zones and probabilities in the middle where the best decision is no decision. — Frank Harrell, May 23 '22 at 12:14
@FrankHarrell: I only use probabilistic rules in my work but in many cases, people are forced to make a black-vs-white call. Yes, when we plan ahead it is obviously wrong to ignore grey zones but for a specific unit-of-analysis (patient, house, car, whatever) we other "treat" or not. — usεr11852, May 23 '22 at 14:02
That is not the case in general. Decisions are delayed. Decisions are implemented on a provisional basis. Doctors do 1-month trials of drugs or start the patient on a lower dose when there is uncertainty. Most of all people delay decisions until they get more data when probabilities are in the middle. You have oversimplified how decisions are handled. — Frank Harrell, May 23 '22 at 19:14
I recommend that you clearly, cleanly and consciously draw a distinction between the probabilistic statistical modeling aspect and the subsequent decision, which takes as inputs probabilistic predictions, but also costs of decisions. Note I am not writing "costs of misclassifications". See here: Reduce Classification Probability Threshold — Stephan Kolassa, Sep 14 '22 at 06:34
@StephanKolassa I don’t mean to be the arrogant punk I warn against being in my answer, but did you mean to award a bounty? — Dave, Sep 18 '22 at 06:05
@StephanKolassa I don’t see a bounty. Perhaps something went awry because I edited the post while you set up the bounty? — Dave, Sep 18 '22 at 07:20
@Dave: hm. I do see it, both here and on the bounties tab. I plan on awarding it in two or three days and keep the thread visible until then. — Stephan Kolassa, Sep 18 '22 at 07:24
@StephanKolassa That makes sense. I thought something went wrong because I edited the answer with Tim’s link as you set up the bounty, but I see your point about keeping this visible in the bounty tab as long as possible. (And thanks for the points! I’ve learned a lot of this because of you!) — Dave, Sep 18 '22 at 07:28

Dave · Accepted Answer · 2022-10-22T15:08:18.453

Your question indicates a number of common errors in regression modeling.

Omission of the intercept in order to have fewer parameters and less overfitting

Particularly in an imbalanced problem where you want the model to be skeptical about membership in the minority class, the intercept is important. I understand the idea of model parsimony in hopes of avoiding overfitting, but the intercept plays a unique role in the model by affecting the outcome in the same way no matter what the features are.

No other parameter in the model can do this, so if you're going to do away with a parameter, that isn't the one you want to lose unless you have an excellent reason for doing so (such as knowing that the outcome is zero when the features all are zero).

Pure linearity

You don't state anything about this in the question, but you are allowed to have nonlinear features like splines in your generalized linear model. By allowing the probability to increase, decrease, and then increase again (for instance) as a feature increases, you allow for better predictions than you would get from forcing the straight-line fit that I suspect you are using.

Parsimony

In one of his videos, the same Frank Harrell as in the comments mentions that parsimony is the enemy of predictive accuracy. My answer here gets into some of why that might be the case and links to additional material (by Frank Harrell). If you're worried about overfitting but want to have a lot of variables, interactions, and nonlinear features (e.g., splines), use some kind of penalized estimation, such as a ridge penalty. This fights overfitting without getting rid of any of the parameters.

All of these and more are discussed in Frank Harrell's Regression Modeling Strategies textbook.

Finally, by having a good model that considers all of these issues, you can, if you must, apply a threshold to make dichotomous decisions based on the known costs of mistakes, and that is how you should pick your threshold. I would argue that, if you don't know the costs of your mistakes, you have no business making classifications and should only give predicted probabilities.

In the extreme, where it is unacceptable to mistake a $0$ for a $1$ or a $1$ for a $0$, the best dichotomous decision might be to classify every instance as one class so that you never make the unacceptable mistake.

Additionally, don't be surprised when you find yourself unable to make accurate dichotomous decisions for instances whose predicted probabilities are near the threshold. The best decision in such a situation might be no decision. This is related to what Kolassa means when he writes that "a cost optimal decision might also well include more than one threshold!"

If I may pontificate at the end of my answer, your post suggests that you face a common issue for data scientists to face. You don't have to knuckle under to every bad idea someone has. You're the expert; act like it. This doesn't mean being an arrogant punk, but you're allowed to explain why someone is wrong. This is scary the first time you have to do it, and it is when you are most valuable as a statistician or regression expert. (It also get less scary as you do it more and more.)

+1 Btw, in #1 it is worth mentioning this thread that discusses it in detail https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model — Tim, Sep 14 '22 at 06:48

Match number of positives in unbalanced data set

1 Answers1

Linked