Best way of splitting panel data for machine learning

Question

I am trying to train a machine learning model to predict the probability that a given credit card customer defaults within the six-month window after the observed date. In this context, default means the customer failed to pay one or more of his credit cards for more than 90 days.

To do this, I have gathered monthly data for all customers' payment behavior, during a 12-month period. Also, I have some demographic features for every customer. I believe this falls under 'panel data', where there are several observations per customer over time.

When put into production, the data used to train this model will always be at least six months behind the data used to predict. This means that this new data will consist mostly of clients that already showed up in training (with similar features and response), and also new clients that the client hasn't seen before. Ideally, the model should predict well for both of these clients.

What would be the best way to split this data into train, test and validation sets for modelling? Should I split along the time axis (therefore having the customers repeat in each set, but the month of the observation will show up only on one) or should I do something like sklearn's GroupKfold method (therefore having each customer show up only on one set but having all months show up in each set)?

The reason I have chosen to use a panel dataset instead of data from a single month is because there seems to be a connection between the month of the year and customers' payment, which I would like to capture. Also, I suspect there isn't enough data in a single month for the model to learn the problem well, especially since dataset is heavily imbalanced. But should I be using panel data at all or is my approach incorrect?

Thanks in advance.

score 2 · Accepted Answer · answered Aug 25 '23 at 17:23

I’m doing a similar project for a big retail company. The “best” way of splitting data set depends on the data itself and the model you picked.

Based on the context, I’d suggest you first try strata sampling based on the demographics and payment failures (I believe its your dependent variable). Based on my experience, people cannot change their financial situation or the underlying money habits in a short period of time. Which means we can better know who they are based on records over a long period of time, and make better short term prediction.

But if you have limited counts of customers or your model doesn’t need long lagged features(meaning you don’t need to look back to far in the past). You can try cut on time dimension. Or reshape your dataset in the following way (popular in retail industries), label customers as new customers if they haven’t shopped (failed to pay) for X months since last time. This methods can be very helpful when you have limited customers but long records over time.

In short, I’d suggest you try both, get more insights and pick the one that better convinces you and your clients.

Best way of splitting panel data for machine learning

1 Answers1