I want to predict customers who are likely to purchase Kid Cudi's new album in a few weeks so I can perform targeted marketing. This event hasn't happened before. But I have data very similar to this event, I have album sales for Adele last year. So my data for Adele looks like this, the target variable is "adele CDs":
╔══════════╦════════════════╦═══════════════════╦══════════════╦═══════════════════╦═══════════════╗
║ customer ║ past purchases ║ total money spent ║ customer age ║ vip member status ║ adele CDs ║
╠══════════╬════════════════╬═══════════════════╬══════════════╬═══════════════════╬═══════════════╣
║ 1 ║ 2 ║ 400 ║ 22 ║ yes ║ purchased ║
║ 2 ║ 1 ║ 134 ║ 19 ║ yes ║ none ║
║ 3 ║ 13 ║ 1050 ║ 44 ║ no ║ none ║
║ 4 ║ 4 ║ 677 ║ 33 ║ no ║ none ║
║ 5 ║ 4 ║ 500 ║ 62 ║ no ║ none ║
║ 6 ║ 7 ║ 900 ║ 27 ║ no ║ purchased ║
║ 7 ║ 3 ║ 345 ║ 21 ║ yes ║ none ║
╚══════════╩════════════════╩═══════════════════╩══════════════╩═══════════════════╩═══════════════╝
This would be great, except the problem is that my data of potential customers is massive (100,000) and my CDs sold is tiny (500). Every predictive model I apply results in 100% classification of the majority class - no adele CDs, and 0% classification of purchasing adele CDs.
However, if I massively under sample the majority class and keep all "yes" purchases, I can reduce my data to 500 no customers and 500 yes customers, then I see the model predicts younger customers being more likely to buy CDs as well as other patterns.
After under sampling, I was thinking I would output a probabilistic classifier with random forest, but make it such that the output has to be over a higher threshold (say 0.85) to classify, else no sale. What would you do in this situation?