Should we (under- or over-) sample when training a ML model, if we care about edge cases?

Question

I know this question has been somehow reiterated in multiple ways, but I have not yet found an answer that would explicitly apply to my case. I wish to train a classification model to predict who is most likely to perform action X. Morever, I wish to use the output of the classifier (not 0 and 1, but the number between 0 and 1) the model outputs) to rank the customers from the most to the least likely, thus not caring about thresholds etc. However, I have a very small amount of positive samples - depending on the specifics, sometimes 0.1% of population, sometimes 1%.

On one hand, I read that on Stack and similar forums resampling in general might not even make sense, on the other hand I am thinking whether without any resampling the model will, figuratively speaking, pay enough attention to those edge cases and what type of customers are in the "positive" group.

What do you as the more statistically-inclined people think about the topic?

EDIT: As the original question might've been imprecise, let me give an example. Right now, 1% of people bought product X. We want to market product X to our customer base, but instead of doing a random/semi-random send, we want to target people who the model predicts are most likely to buy it.

It depends on your goal, what are the costs of misclassification? — Ggjj11, Aug 17 '23 at 13:21
Cost of misclassification are miniscule, it's just about the model helping in selecting customers most likely to perform a certain action (and, with some Shapley values and stuff, explaining what features the model considers important). Those are 2 things I care about. I won't use the classifying (as in, binary) part per se. — BloodthirstyPlatypus, Aug 17 '23 at 13:24
If there is no (abstract) cost of misclassification, why not always propose the majorly selected action? Obviously, in your head you think it is not good, so you assign it somehow a high cost and want to avoid this behavior. If you can specify it in numbers you can train a model which minimizes this cost function. No? — Ggjj11, Aug 17 '23 at 13:27
Oh, that's on me for not being precise with it. Sorry. I'll add it as "EDIT" to the main post too:
This is not about being 99.9% correct. Let's do it on an example - right now 1% of people buy product X naturally. You want to send emails to 10000 people to buy it, but instead of doing it randomly, you want to use the model to target the customers who are more similar to those who already bought it. Hence making the model. So 10000 of highest scored people will get the email - the question is does sampling affect the quality of the model in this case. — BloodthirstyPlatypus, Aug 17 '23 at 13:32

Dave · Answer 1 · 2023-09-05T04:47:06.907

Good probability models seek out the true probability values and find probability values that reflect the reality of true event occurrence (the predictions are “calibrated”). This is true whether the categories are balanced or not.

When you have considerable imbalance, you are telling the model to be skeptical of membership in the minority category. This makes sense. In the absence of highly compelling evidence of membership in the minority class, it’s probably the case that the observation belongs to the majority class. Consequently, your model might not ever make a prediction above a probability of $0.4$. However, if almost no one responds to the advertisement, think about how much of a win it is to get a prediction like that. Instead of the probability of that individual responding being the proverbial one-in-a-million, the probability is better than one-in-three. Sure, the more likely outcome is that this individual will not respond, but such an individual is so much more likely to respond than usual that it might be worth advertising.

If you have a very small number of minority-category observations, you might have some issues. The King and Zeng paper mentioned in the linked answer addresses sampling techniques to be efficient in collecting members of the minority category and then make corrections later. If you already have data, their ideas do not really apply. If you already have data and find yourself having many members of the minority category despite the imbalance, whatever estimation issues there are when the minority category has a small size have likely been overcome, meaning that techniques like ROSE and SMOTE introduce additional sources of error to fix an issue that has already been fixed.

If “care about edge cases” means that you want to make a point not to miss out on sending an advertisement to someone who is particularly likely to respond, it might be the case that sending an ad to everyone is the best approach. Given what my email inbox looks like, this appears to be a real way for marketing people to approach the problem, perhaps with a reasonable amount of success. If being this extreme is not viable, then accurate probabilities can guide you to good decisions about how likely someone is to respond to an advertisement and if it is worth the cost of sending them the ad. You will get these accurate probabilities by modeling the true amount of skepticism to have about probability of response, not by tricking your model outputs into being artificially high by balancing the categories.

score 0 · Answer 2 · answered Sep 02 '23 at 17:41

Quoting your edit:

Right now, 1% of people bought product X. We want to market product X to our customer base, but instead of doing a random/semi-random send, we want to target people who the model predicts are most likely to buy it.

This is a very important detail. In this case, you are essentially entering into the world of recommender systems. This means your observed values are not merely random samples from the population, but rather you selected which outcomes to observe when you matched the viewer/customer to the item.

There can be a very complicated feedback loop here. Suppose from your initial round of data, you observe that a particular subset has affinity for your product. If from there on out, you want to optimize buys per offer, you are likely to only target that particular subset. In an extreme case, this means you will only collect data from that particular subset and will never generate data that helps you learn about other subsets of the population.

There is starting to be a field of research of exploration within a recommendation system, which is a bit astounding given how long we've had recommender systems and how important this issue is. The basic idea is we want to use our previous knowledge to target customers well (i.e. users with high predicted engagement)...but also reserve a small fraction of our traffic to explore subsets that might be good (i.e. users who we have higher uncertainty in engagement rates), with the acknowledgement that this exploration will have a lower immediate ROI but help grow ROI in the future.

Should we (under- or over-) sample when training a ML model, if we care about edge cases?

2 Answers2