1

I'm in the process of training a NB model based on continuous features that need Equal Frequency Discretization to be used.

Now, the question mark I'm facing is if discretization needs to be performed

  • separately for train and score set

  • appending train and score set together

It comes naturally to me to go for the second approach, as the train and score set can have different distributions for each variable, which would cause different deciles to be generated and therefore different discretization results for the same variable values in the two datasets.

However, I have to admit I'm not encountering much material about the above topic so I would ask to the community if any knowledge is here to share or links to be consulted

bests

1 Answers1

2

The answer is none of the above.

You fit your discretization on the training set. You then apply those cutoffs to the score set.


In the first approach, you’re applying different bins to the training set and score set. They may not line up.

In the second, you’d be peeking at your score set—essentially cheating yourself into a false sense of how good your model is.