5

I'm trying to build a cancellation predictor for telecom data. I am using both static (i.e. location, device, number of complaints, etc.) and temporal (i.e. time-series usage) data. The response variable is whether or not they cancelled within the first 2 months of activation. My question is related to the response variable.

In addition to whether they have cancelled, I also have access to a short text field that the customer has filled out where they explain why they have cancelled. In some cases, the cancellation reason seems legitimately unpredictable (eg. "My phone got stolen" or "I am merging with my wife's account" or "moving out of the country" or "I was just testing the service").

I was wondering if I should remove these instances from the training set. It seems like no matter how good the prediction model is, it can never predict a phone getting stolen for example. So it seems like if I leave these in, it would only hurt the model's performance.

Is this a correct assumption?

user1893354
  • 1,875
  • 4
  • 18
  • 27

1 Answers1

1

It comes down to your brief but in general I would be cautious about removing the data.

Firstly there is little incentive for the customer to be entirely accurate in their reason for churning. Is a customer 'just trying the service' going to leave because of influences entirely outside your control, or only partially outside your control? I would say the residuals cover the areas outside your control but at least some of those events are within your control and so worth including.

Secondly are you trying to determine churn events you can influence (i.e. a marketing view on churn) or track churn from a more financial perspective. If the latter then you will want to include all churn events, though I'll be careful if past history is not going to be a good predictor of future behaviour (e.g. if you used to not offer the iPhone and customers were churning to get it, but you offer it now).

I'm assuming you want to predict the churners so the business can elect to exclude them from some offer. Therefore the problem is not exactly predicting churn, more about predicting who can have their churn decision influenced by an action of the business. i.e. the problem is calculating:

p(churn | customer_attributes + intervention) - p(churn | customer_attributes + !intervention)

For instance death causes churn, but there's nothing you can do to prevent customer death so intervention should not impact the probability of churn and so you should not be more inclined to intervene if a customer looks at high risk of death (e.g. over 80 years old).

corrin
  • 46
  • Actually, the point was to predict churn with the purpose of contacting the flagged accounts in an effort to convince them to stay. Also I am not sure how it would be possible to calculate the probabilities that you mentioned, given that the intervention would be a future event. Are you talking about using data from past interventions to calculate these probabilities? – user1893354 Oct 02 '13 at 13:36
  • Hi, that's what I refer to as the 'marketing' use of this rather than the 'finance'. In terms of predicting the probabilities, what I was referring to is uplift modelling. I have read a paper of it being used in telecommunications so you should be able to find good case studies to base your work on. – corrin Oct 08 '13 at 01:23