I have a classification model that categorises customers risk to a lending company over time. For example, a customer may appear as credit-worthy but with time, new updated data we collect may indicate the contrary. More specifically, I want to build a model that can predict the magnitude that any future changes of a given customer's data will have to my initial hypothesis of the customer. There are many independent variables which are expected to change over time, for example cost of living, nature of banking transactions, employment data, and I am thus wondering how can I construct the architectural design for the ML pipeline once in production to facilitate the model validation based on the change of data for each customer?
-
This appears to be an online learning problem as your features changes over-time significantly, See https://stats.stackexchange.com/questions/301975/does-online-learning-theory-have-any-real-world-applications?rq=1 – patagonicus Jan 03 '22 at 04:06
1 Answers
There are a few options.
A key assumption of most risk models is that the outcomes are independent conditioned on the covariates. So, it may suffice to build something like a logistic regression and update your risk estimate as customer data is updated. This main benefit of this approach is its simplicity: in a batch manner, update your design matrix and re-score individuals. Using interpretable models like logistic regression has the added benefit that you can interpret how changes in one variable will lead to changes to the outcome.
Another approach would be to use a class of models known as mixed models. It may be the case that the risk changes differently for customer A and B as their cost of living changes, for example. We can estimate a effect for each customer, pooling information from all customers to help inform the model on customers we might see in the future.
- 36,121