Low F1 score in a propensity model

Question

I'm trying to model a propensity problem to a product, which is highly unbalanced (99.9% against 0.1%). Here is the details of my dataset:

The target is a flag which indicates if the client bought the product in the future interval of 20 days;
If the client buys the product, he receives a flag 1, and 0 otherwise;
The dataset has 99.9% of 0 and 0.1% of 1 for each day.
- For example, we have a database of 1k clients in day 1, and just 1 client brought the product;
- We have a database of 2k clients in day 10, and just 2 clients brought the product;
To model this problem, I tried two approaches:
- I've made an undersample on the majority class (0), and after that I create the train and test sets. That approach gives me a very good F1 score (about 0.9). However, when I evaluate the model on future data, I've got a very low F1 score (0.1, high recall and low precision).
  - I think this is because the "real life data" isn't balanced, and my test set used for the F1 calculation wasn't unbalanced.
- I've made an undersample just on the training set, and evaluate the problem on a test set which didn't had a previous undersample. Now, I've got a realistic F1 score (0.1, high recall and low precision);
I'm using transactional features on other products and sociodemografic features;
I tested the Gradient Boost and Random Forest algorithms;
The model is intended to score the model daily.

My question is: Any of my approaches is correct? And, how can I get a better score?

Thanks.

Welcome to Cross Validated! One issue to consider is why you want to optimize your $F_1$ score when proper statistical methods show such a statistic to have considerable issues. — Dave, Feb 21 '23 at 19:59

0 Answers0