I'm trying to model a propensity problem to a product, which is highly unbalanced (99.9% against 0.1%). Here is the details of my dataset:
- The target is a flag which indicates if the client bought the product in the future interval of 20 days;
- If the client buys the product, he receives a flag 1, and 0 otherwise;
- The dataset has 99.9% of 0 and 0.1% of 1 for each day.
- For example, we have a database of 1k clients in day 1, and just 1 client brought the product;
- We have a database of 2k clients in day 10, and just 2 clients brought the product;
- To model this problem, I tried two approaches:
- I've made an undersample on the majority class (0), and after that I create the train and test sets. That approach gives me a very good F1 score (about 0.9). However, when I evaluate the model on future data, I've got a very low F1 score (0.1, high recall and low precision).
- I think this is because the "real life data" isn't balanced, and my test set used for the F1 calculation wasn't unbalanced.
- I've made an undersample just on the training set, and evaluate the problem on a test set which didn't had a previous undersample. Now, I've got a realistic F1 score (0.1, high recall and low precision);
- I've made an undersample on the majority class (0), and after that I create the train and test sets. That approach gives me a very good F1 score (about 0.9). However, when I evaluate the model on future data, I've got a very low F1 score (0.1, high recall and low precision).
- I'm using transactional features on other products and sociodemografic features;
- I tested the Gradient Boost and Random Forest algorithms;
- The model is intended to score the model daily.
My question is: Any of my approaches is correct? And, how can I get a better score?
Thanks.