I've constructed a training dataset by creating monthly timestamps. However, a significant issue has arisen: there's a substantial data imbalance (240,000 rows for active clients versus only 500 for churned clients) between January 2021 and December 2022. This imbalance results in the use of multiple algorithms yielding a very low recall, not exceeding 0.2
I have tried several algorithms. The decision tree classifier provides excellent recall for the training data, but for the test data, it drops significantly. Clearly, it's an overfitting issue.
However, for other algorithms like XGBoost or logistic regression, the recall is low both in training and testing.
I want to know how to address this issue of customer attrition. I've been trying various models for a while now, but nothing seems to work.