So I'm using survival analysis to help identify customers at risk of churning. As of now, I've been using a hugely imbalanced dataset that consists of 99.5% of customers in the majority class (i.e. not churned) and 0.5% churned. To improve my model I've randomly undersampled the majority class such that the ratio between the minority and majority classes is only 1/3, which resulted in improvements when compared to using the full dataset.
However, in order to avoid removing data points which can be important, I would like to try oversampling techniques such as SMOTE, etc but I don't know how well this would suit survival models, because the duration (target variable) depends on a churn date/time which is the result of the difference of two variables in the dataset (date_client_churned + date_client_joined). Isn't generating new examples going to mess with the distribution of these two variables and as such produce erratic results? Has anybody dealt with this kind of imbalance in the dataset for survival analysis, and if so what was your approach? Is it really necessary to balance the dataset for survival analysis?
For reference, the survival models that I've used are: Cox Proportional Hazards, Random Survival Forests and Gradient Boosted Trees with Cox loss.
Just to clarify how I compute the duration variable: For time t = 0 I'm using the date a customer signs a contract with the company, given by t_born, and for the end date I'm using either a) the churn date t_churn if t_churn < t_study, where t_study represents the end of the study period (t_study=t_born+6, t_study=t_born+12, t_study=t_born+24 months); b) t_study otherwise. The duration column will then be given by the difference t_churn - t_born if t_churn < t_study; t_study - t_born otherwise.
durationshould be the difference of two dates,date_client_churned - date_client_joined, to represent the length of time that the individual was a client. Please edit the question to explain what you are using for the referencetime = 0for individual customers and why you are using the sum of those two dates for thedurationsurvival time. Please do that by editing the question, as comments are easy to overlook and can be deleted. Also, note that this might better be handled by a cure model that accounts for individuals that never have events. – EdM Aug 09 '22 at 15:00