0

I am building a pipeline in a machine learning project in which I would like to automatically discretize variables containing NAs. These NAs are justified in the context of the research and it is impossible to remove the rows or impute the values as this would bias the results, hence the choice of binning.

I use optimal binning to bin the variables, a method presented in [1] and implemented in a Python module. This method is the "optimal discretization of a variable into bins given a discrete or continuous numeric target".

In my pipeline:

  • I optimally bin each variable given my continuous target.
  • I one hot encode all the discrete variables created
  • I use a regression technique (e.g. xgboost regressor)

If I use K-fold cross-validation, it seems that the binning algorithm provided in the Python module is not rerun on the training set for each fold, but is rather run only once on the whole sample (training + validation). Only after is the cross-validation performed.

My question is: If I use K-fold cross-validation to tune the hyperparameters of my regression model, is it correct to bin the variables on all data and not only on the training set? Isn't there a risk of using validation data (including the target values) to determine the training set variables via automated binning?

[1] Navas-Palencia, G. (2022). Optimal binning: Mathematical programming formulation (arXiv:2001.08025). arXiv. http://arxiv.org/abs/2001.08025

0 Answers0