I am trying to train a machine learning model to predict the probability that a given credit card customer defaults within the six-month window after the observed date. In this context, default means the customer failed to pay one or more of his credit cards for more than 90 days.
To do this, I have gathered monthly data for all customers' payment behavior, during a 12-month period. Also, I have some demographic features for every customer. I believe this falls under 'panel data', where there are several observations per customer over time.
When put into production, the data used to train this model will always be at least six months behind the data used to predict. This means that this new data will consist mostly of clients that already showed up in training (with similar features and response), and also new clients that the client hasn't seen before. Ideally, the model should predict well for both of these clients.
What would be the best way to split this data into train, test and validation sets for modelling? Should I split along the time axis (therefore having the customers repeat in each set, but the month of the observation will show up only on one) or should I do something like sklearn's GroupKfold method (therefore having each customer show up only on one set but having all months show up in each set)?
The reason I have chosen to use a panel dataset instead of data from a single month is because there seems to be a connection between the month of the year and customers' payment, which I would like to capture. Also, I suspect there isn't enough data in a single month for the model to learn the problem well, especially since dataset is heavily imbalanced. But should I be using panel data at all or is my approach incorrect?
Thanks in advance.