Cross-validation and automated binning of a continuous variable for a continuous target

Question

I am building a pipeline in a machine learning project in which I would like to automatically discretize variables containing NAs. These NAs are justified in the context of the research and it is impossible to remove the rows or impute the values as this would bias the results, hence the choice of binning.

I use optimal binning to bin the variables, a method presented in [1] and implemented in a Python module. This method is the "optimal discretization of a variable into bins given a discrete or continuous numeric target".

In my pipeline:

I optimally bin each variable given my continuous target.
I one hot encode all the discrete variables created
I use a regression technique (e.g. xgboost regressor)

If I use K-fold cross-validation, it seems that the binning algorithm provided in the Python module is not rerun on the training set for each fold, but is rather run only once on the whole sample (training + validation). Only after is the cross-validation performed.

My question is: If I use K-fold cross-validation to tune the hyperparameters of my regression model, is it correct to bin the variables on all data and not only on the training set? Isn't there a risk of using validation data (including the target values) to determine the training set variables via automated binning?

[1] Navas-Palencia, G. (2022). Optimal binning: Mathematical programming formulation (arXiv:2001.08025). arXiv. http://arxiv.org/abs/2001.08025

Why do you want to bin in the first place? There are extremely good arguments against throwing information away like that, so I am curious about your actual reason. — Stephan Kolassa, Nov 10 '23 at 18:12
Thanks. I have (meaningful) NAs in the variables and I believe that not all regression methods accept NAs. I think xgboost does but some don't, among which OLS in the first place. — hexolitemax, Nov 10 '23 at 18:16
More specifically, xgboost with tree-based boosters accept NAs since they "allocate" NAs to the correct branch at each step. However xgboost with linear model based boosters don't. See: https://xgboost.readthedocs.io/en/stable/faq.html#how-to-deal-with-missing-values — hexolitemax, Nov 10 '23 at 18:22
Why do you think missing data can usefully be addressed by binning? — Stephan Kolassa, Nov 10 '23 at 18:43
Because NAs would then be considered one of the categories. But honestly, it was not a choice, but I just do not see how to deal with them differently. — hexolitemax, Nov 10 '23 at 18:49
Also, I forgot to mention but I do have so extreme values to, which was a motivation to bin. — hexolitemax, Nov 10 '23 at 18:54
There are different reasons: sometimes it's just missing information (not publicly available, e.g. contract information), and sometimes it is because it's sporting data (e.g. some players do not play and hence have NAs in goals scored). — hexolitemax, Nov 15 '23 at 16:29

Cross-validation and automated binning of a continuous variable for a continuous target

0 Answers0