Is under sampling the majority population useful to predict a rare event if I limit the probabilistic classifier over 0.85?

Question

I want to predict customers who are likely to purchase Kid Cudi's new album in a few weeks so I can perform targeted marketing. This event hasn't happened before. But I have data very similar to this event, I have album sales for Adele last year. So my data for Adele looks like this, the target variable is "adele CDs":

╔══════════╦════════════════╦═══════════════════╦══════════════╦═══════════════════╦═══════════════╗
║ customer ║ past purchases ║ total money spent ║ customer age ║ vip member status ║ adele CDs     ║
╠══════════╬════════════════╬═══════════════════╬══════════════╬═══════════════════╬═══════════════╣
║        1 ║              2 ║               400 ║           22 ║ yes               ║ purchased     ║
║        2 ║              1 ║               134 ║           19 ║ yes               ║ none          ║
║        3 ║             13 ║              1050 ║           44 ║ no                ║ none          ║
║        4 ║              4 ║               677 ║           33 ║ no                ║ none          ║
║        5 ║              4 ║               500 ║           62 ║ no                ║ none          ║
║        6 ║              7 ║               900 ║           27 ║ no                ║ purchased     ║
║        7 ║              3 ║               345 ║           21 ║ yes               ║ none          ║
╚══════════╩════════════════╩═══════════════════╩══════════════╩═══════════════════╩═══════════════╝

This would be great, except the problem is that my data of potential customers is massive (100,000) and my CDs sold is tiny (500). Every predictive model I apply results in 100% classification of the majority class - no adele CDs, and 0% classification of purchasing adele CDs.

However, if I massively under sample the majority class and keep all "yes" purchases, I can reduce my data to 500 no customers and 500 yes customers, then I see the model predicts younger customers being more likely to buy CDs as well as other patterns.

After under sampling, I was thinking I would output a probabilistic classifier with random forest, but make it such that the output has to be over a higher threshold (say 0.85) to classify, else no sale. What would you do in this situation?

What were the exact problems you faced when using random forests? — Tim Biegeleisen, Dec 03 '16 at 07:16
You should try to predict the *probability that somebody will buy (logistic regression), and then targeting those where that probability is sufficiently high, even if below 50%. There are many post about this in here ... http://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning http://stats.stackexchange.com/questions/116632/how-to-partition-a-training-set-when-i-have-a-big-class-imbalance There are many Qs in here about this, but few good answers ... — kjetil b halvorsen, Dec 03 '16 at 16:23
Random forests work great, but they are only working if I under sample. I am aware this topic is discussed in other questions, but I believe the topic has few answers that deeply explain the techniques involved. — barker, Dec 03 '16 at 20:03
If that is a problem with random forrest, then maybe try a learning technique for which it is not a problem, like logistic regression? — kjetil b halvorsen, Dec 04 '16 at 15:30
Thanks, what benefit would logistic regression provide given that I can use random forests to output probabilistic classifiers? — barker, Dec 05 '16 at 16:57
@barker Logistic regression is easy to interpret. Often it is better to go with a simpler model. If you reduced the size of your sample to limit positive and negative outcomes, you should also be aware of retrospective sampling and what it means for you: https://www.statsdirect.com/help/basics/prospective.htm — M Waz, Jul 08 '19 at 20:34
the problem is that random forests are bad probabilistic classifiers - they are poorly calibrated. xgboost type algorithms work better — seanv507, Apr 24 '23 at 16:58
https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem/613848#613848 — Dikran Marsupial, Apr 24 '23 at 17:05
I hope we can setting this general issue as a lot of people continue to waste time on over- and under-sampling of data in a way that distorts the data and renders classifiers irrelevant to new data that was not manipulated in this artificial way. If doesn’t matter which predictive method is used; it’s always bad statistical practice to molest the data at hand. Think of the simple case of a binary logistic regression. You’ll destroy the intercept making all predictions bad. Forced-choice classifiers are destroyed in a different way. — Frank Harrell, Feb 02 '24 at 13:20

score 0 · Answer 1 · answered Apr 24 '23 at 16:43

It would probably make more sense to predict either the probability of CD purchase or at least the relative rankings of probability to purchase.

If you have the relative rankings, you know (or predict) the top $N$ people most likely to make a purchase. If you have a certain budget to spend on advertising, you can spend it on the customers most likely to make a purchase. Harrell discusses that on his blog when he refers to a "lift curve". An advantage of using the rankings over the probabilities is that you do not have to waste resources trying to pin down the exact probabilities.

If you have the probabilities, you can do this ranking. However, it gives additional information. For instance, if there is a precipitous drop in probability to make a purchase before you have exhausted your budget, you can choose to save money instead of advertising to people who are unlikely to respond. Conversely, you can produce evidence that you should have a larger budget if you have people with a high predicted probability to respond yet not enough budget to advertise to them. Finally, if you get a result that no one is likely to make a purchase, it seems valuable to know that an upcoming release is likely to be a flop.

Neither the rankings nor the full probability predictions are heavily affected by this kind of class imbalance. Consequently, discarding precious data is, probably, unwise. Your result of no one being predicted to buy an Adele CD comes from the fact that, under the hood, your model is applying a threshold, likely requiring a predicted probability of at least $0.5$ to be classified as making a purchase. When you model the probability of purchase (or at least the relative rankings), there is no such threshold.

Is under sampling the majority population useful to predict a rare event if I limit the probabilistic classifier over 0.85?

1 Answers1