1

I'm in charge of a scoring project and I'm getting average results with my binary classification model.

A large number of reasons could explain this, but I think one of them is the heterogeneity of the population in my training dataset. I think that giving up on this global model for more specific sub-models could help improving the results.

For example : If we were in a retail company trying to find the best advertizing chanel, we could try to build a model for people shopping on stores and another one for people shopping online.

My question then is : What would you do to justify such sub-models approach ? I have a panel of 4/5 variables being regurlaly used for business strategies potentially with a good discriminatory power. My idea would be to use them to prove that somehow clients shopping only online are so different from the others that they should have their own model.

I read about some technics like T-SNE or Mapper Algorithm (I know of course about the more traditional ones). These technics are well documented online, but I hardly find their implementation for my use case.

I also thought about

Thank you for your help !

Son6
  • 11

1 Answers1

0

If there exists a subgroup that differs from the rest of the data, you should be able to identify it using exploratory data analysis. Create plots and calculate descriptive statistics per each subgroup. Are there any visible differences? How do they compare to the same plots and statistics calculated on the whole dataset? If there is a clear subgroup, in many cases it would be also detected if you used some clustering algorithm on the data.

However, you seem to be interested in predictive performance, so why not just train a model on the whole data, and models for subgroups, and compare their performance? For the subgroups, you would simply have a model per subgroup and use it for making predictions for the subgroup, and then concatenate all the predictions, so that they can be compared as a whole to the global model.

Finally, there are models that can account for the clustering in the data. In the case of some models, this can be handled by using a dummy variable for the subgroups, so that the model can figure out how to account for the differences. There are models designed for such nesting like . If you don't know the exact groups and you need to cluster the data, there are models that can cluster your data and find a sub-model for each cluster.

Tim
  • 138,066