Data is unbalanced, but the train and test set don't represent it

Question

I'm training a classifier with a binary target variable.

My data is unbalanced. The problem is that the training data (split into train, val, and test sets) is more balanced than the real data (the data in production). there are some reasons for that:

I manually added more positive observations to the train set, so the model will know to recognize them and differentiate them from the negative ones- I prefer to add real positive observations to the train data, rather than creating synthetic data points with an oversampling algorithm
I delete some recent negatives from the training data, for the reason that they may become positive later (For example, if I predict churn, I can add to the training data users from the last week that had churned, but if they not, I will not train on them and instead I will apply the trained model on them)

Therefore, the proportion of the positive labels in my training datasets (tain, val, and test) is higher than the "real data" positive proportion.

According to this reasoning, what are your recommendations?

score 0 · Answer 1 · answered Oct 25 '22 at 13:11

0

I don't see a problem here. It is good practice to train with more balanced datasets especially when the data obtained in production isn't as balanced. I generally would get familiar with metrics like Precision, Recall, F1_Scores and AUC-ROC curves in order to assess your model [performance].

answered Oct 25 '22 at 13:11

OliverHennhoefer

148

I don't think that balancing the test set, as he described, is a good approach. – Amit S Oct 27 '22 at 05:36
True, I ignored that part of his question. Although, I limited my answer to specifically only train with balanced data. – OliverHennhoefer Oct 27 '22 at 05:47

score 0 · Answer 2 · edited Oct 27 '22 at 05:52

0

Balancing the train set with real positive observations may be a good approach. However, the validation set and especially the test set should be similar to the real data. The solution is simple: split to train, val and test set, and add the manual positive observations only to the training set. Here you can is similar question.

edited Oct 27 '22 at 05:52

OliverHennhoefer

148

answered Oct 27 '22 at 05:41

Amit S

57

Want to add https://stackoverflow.com/a/55921736/16420204 – OliverHennhoefer Oct 27 '22 at 05:51

Data is unbalanced, but the train and test set don't represent it

2 Answers2