Churn prediction model

Question

I've constructed a training dataset by creating monthly timestamps. However, a significant issue has arisen: there's a substantial data imbalance (240,000 rows for active clients versus only 500 for churned clients) between January 2021 and December 2022. This imbalance results in the use of multiple algorithms yielding a very low recall, not exceeding 0.2

I have tried several algorithms. The decision tree classifier provides excellent recall for the training data, but for the test data, it drops significantly. Clearly, it's an overfitting issue.

However, for other algorithms like XGBoost or logistic regression, the recall is low both in training and testing.

I want to know how to address this issue of customer attrition. I've been trying various models for a while now, but nothing seems to work.

Dave · Answer 1 · 2023-08-30T10:53:46.640

Class imbalance is almost never a problem, and perceived problems with imbalance typically indicate issues with model evaluation instead of the imbalance itself.
The imbalance tells the model to be skeptical about the minority class, which can be formalized with Bayes’ theorem (even without doing Bayesian modeling).
Models often output values on a continuum that can be interpreted as a probability. The logistic regression you ran is among those models. While those predicted probability values can be useful on their own and evaluated for prediction quality, converting them to categorical classifications requires some kind of decision rule, typically a threshold where predictions above are categorized one way and below the other way. Consequently, while more sophisticated methodology will consider the probability predictions themselves, if you must consider the categorical predictions, at least consider changing the threshold to tune your model to give desirable values of, for instance, precision, recall, and specificity. In fact, when you plot receiver-operator characteristic curves or precision-recall curves, that threshold is being varied to give the multiple values of precision, recall, and specificity.
If you just want high recall (synonymous with sensitivity), then you can set that threshold to be $-\infty$ and classify everyone as churning. Then your recall is a perfect $100\%$, and you don’t have to spend your time trying out multiple models or even collect any data at all. If this is unacceptable (as it typically is), there there must be some cost associated with classifying someone as a churn when they will not, and that leads to assessments of the predicted probability values or at least a consideration of the decision rule/threshold giving the predicted categories.
You don’t have to use the software default from a predict method that just gives the category with the highest predicted probability (sometimes called an “argmax” decision rule). Especially in an imbalanced problem, the predicted probability values are likely to be on the small side, since that minority category is just generally unlikely. As you vary the threshold, you might find stronger performance in terms of recall. You might even be able to achieve strong recall while maintaining strong performance in terms of specificity or precision. The ROC and (arguable especially) PR curves can help you see what kind of performance is possible just by changing the threshold.

While these are not so explicitly related to the exact posted question, Frank Harrell, founding chairman and current professor of Biostatistics at Vanderbilt University, has two good blog posts related to what I’ve discussed above.

Classification vs. Prediction

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Churn prediction model

1 Answers1