0

I'm training a classifier with a binary target variable.

My data is unbalanced. The problem is that the training data (split into train, val, and test sets) is more balanced than the real data (the data in production). there are some reasons for that:

  1. I manually added more positive observations to the train set, so the model will know to recognize them and differentiate them from the negative ones- I prefer to add real positive observations to the train data, rather than creating synthetic data points with an oversampling algorithm
  2. I delete some recent negatives from the training data, for the reason that they may become positive later (For example, if I predict churn, I can add to the training data users from the last week that had churned, but if they not, I will not train on them and instead I will apply the trained model on them)

Therefore, the proportion of the positive labels in my training datasets (tain, val, and test) is higher than the "real data" positive proportion.

According to this reasoning, what are your recommendations?

Amit S
  • 57

2 Answers2

0

I don't see a problem here. It is good practice to train with more balanced datasets especially when the data obtained in production isn't as balanced. I generally would get familiar with metrics like Precision, Recall, F1_Scores and AUC-ROC curves in order to assess your model [performance].

0

Balancing the train set with real positive observations may be a good approach. However, the validation set and especially the test set should be similar to the real data. The solution is simple: split to train, val and test set, and add the manual positive observations only to the training set. Here you can is similar question.

Amit S
  • 57