I'm trying to build a cancellation predictor for telecom data. I am using both static (i.e. location, device, number of complaints, etc.) and temporal (i.e. time-series usage) data. The response variable is whether or not they cancelled within the first 2 months of activation. My question is related to the response variable.
In addition to whether they have cancelled, I also have access to a short text field that the customer has filled out where they explain why they have cancelled. In some cases, the cancellation reason seems legitimately unpredictable (eg. "My phone got stolen" or "I am merging with my wife's account" or "moving out of the country" or "I was just testing the service").
I was wondering if I should remove these instances from the training set. It seems like no matter how good the prediction model is, it can never predict a phone getting stolen for example. So it seems like if I leave these in, it would only hurt the model's performance.
Is this a correct assumption?