0

I have a data set of historic behavior (2 years back) attributes for about 2 million customers and I want to make predictions of the customers probabilities coming 365 days. I have about 30 different features. I looked at customer behavior during 2021, 2022 and added the target variable (profitability) based on 2023. This target is continuous and can also be negative. However when plotting the distribution of this variable I see that about 65% of the customers have zero profitability.

Distribution of target variable

I tested 8 different regressors (LinearReg, AdaBoost, DecisionTree, RandomForest, GradientBoost, XGBoost, KnearestNeighbor, LGBM) using 5-fold CV and tuning and after predicting on the test set I got the following result from the best performing one (GradientBoost):

Scatter plot actual vs predicted

These results are not promising and I don't know how to proceed. What more can I test? Are there any better ways to model zero-inflated distributions and use this model to make future predictions?

Parseval
  • 363
  • 3
  • 9
  • 1
    Regression has many meanings and methods. You should make it explicit which ones you've tried. Using these keywords -- best regression method for zero inflated data -- in a google search returned many suggestions. It's also discussed on CV, e.g., here https://stats.stackexchange.com/questions/582124/when-should-one-use-a-tweedie-glm-over-a-zero-inflated-glm – user78229 Feb 21 '24 at 13:09
  • 2
    rather than blindly looking at statistical methods to apply, can you add more analysis of the business problem you are trying to solve in the question text and why 0 profitability is so common. I am guessing you would be better off splitting into 2 problems: a) do customers make a transaction b) given they did, what was the profitability. see hurdle model https://en.wikipedia.org/wiki/Hurdle_model – seanv507 Feb 21 '24 at 13:14
  • @seanv507 Your point wrt blind model sleuthing is appropriate in a hypothesis testing environment but wrt predictive modeling? Not so much. See Breiman's famous Two Cultures paper for confirmation, among others. The point is that prediction casts aside most classic regression assumptions. Consider his random forests method or the many ensemble modeling approaches to prediction. Iow, building widely informed predictive models is recognized for delivering accuracy and success. – user78229 Feb 22 '24 at 03:03
  • @MikeHunter: Thanks for your answer. I've now edited in the regressors I've tested. The search results I get are only for zero-inflated data that is either non-negative continuous or discrete count data. In my case I have continuous data. Assume then I somehow can use a model to first classify the customers into neg, 0, pos. If 0 then there will be no regression, if neg, pos we will make a regression. But given the extreme variance (1015879.7) and mean (3.14) isn't it unlikely that any regressor will perform well, even if the zeros were not there? – Parseval Feb 22 '24 at 07:40
  • 1
    Could you explain the response variable and how it is measured in detail? – utobi Feb 22 '24 at 08:03
  • 1
    @MikeHunter I think you are missing the point. Rule 1 of any modelling is understanding the business problem. its just called feature engineering in ML.see rules of ML – seanv507 Feb 22 '24 at 09:07
  • 1
    @utobi: In a retail e-commerce customers purchase stuff online, however many of these purchases are returned and this is a cost for the company. Example, the customer spends 100 on (5 items 20 each), but then returns 2 of them. Assume the cost of handling the returns are 10 per item then the total profitability is 100-40-20 = 40. Now if the customer returns all the items the profitability, by the same logic, would be -50. – Parseval Feb 22 '24 at 09:17
  • @seanv507 I stand by my comment. That rule is a legacy of 20th c hypothesis testing and thinking. It's intended for PhD grad students and simple-minded corporate types who insist on certainty and the elimination of doubt. Business problems are multi-focal with no one right answer, much less a precisely correct model...not to mention that it's a VUCA world today -- volatile, uncertain, complex and ambiguous. Try and fit a single predictive model to that! See this https://stats.stackexchange.com/questions/215154/variable-selection-for-predictive-modeling-really-needed-in-2016/215235#215235 – user78229 Feb 22 '24 at 09:27

1 Answers1

1

It's apparent that you have a good grasp on the business problem. This response is to suggest alternative ways to think about it.

One thing you haven't mentioned is the industry for which these models are being built. Knowing this would help set expectations about how any model is likely to perform. For example, in direct marketing, strategists expect a large percentage of customers to be one time shoppers. So, given a purchase in one year the likelihood of a purchase in subsequent years is extremely small. In health insurance, actuaries expect several things: most members will never file a claim, the unhealthiest ~5% of members are likely to drive 20% (or more) of claims, claims are extremely fat tailed, and so on.

Insurance actuaries have developed a boatload of metrics, some of which may have relevance for your project, e.g., https://guidingmetrics.com/content/insurance-industrys-18-most-critical-metrics/

That said, your current approach reminds me of a problem once presented to me. I was given a composite metric of claims severity (cost) divided by frequency (how many filed). This metric was this company's unique definition of a loss ratio. The objective was to build a predictive model with this composite as the target. I was not given access to the input components, just the composite.

The problem was that this composite contained some huge outliers, both large and miniscule, e.g., members with a single hugely expensive claim on the one hand and others with many low cost claims on the other.

Modeling the composite proved to be extremely difficult. If I had had the components it would have been much easier. I would have built two models and combined them later to form the loss ratio.

The point is that, if it's possible, decomposing your profitability metric into its component parts, modeling those and combining them later to create profitability might be a more workable solution. For example, your metric includes returns and, therefore, is downstream from actual purchases. As a rule, sales revenue is a less reliable metric of customer behavior than unit sales, and so on.

The next suggestion is based on your chart showing a distribution of customer profitability. It's apparent that profitability is extremely fat tailed. Given that, traditional models based on normally distributed data can be expected to perform poorly since, by definition, they can't predict the tails.

Alternative approaches include:

Hope this helps.

user78229
  • 10,594
  • 2
  • 24
  • 45