I'm training a binary classifier on imbalanced data (The real/production data has ~%2 of positive labels). Besides the questionable efficiency of oversampling/undersampling technique, I have a lot of training data, so I can manually add real positive observations to the data, instead of synthesizing them with an oversampling technique. My assumptions, generally based on intuition, are:
The model should be trained on a dataset with more than 2% of positive labels
The test set should be as similar as possible to the real data (in this case, to have the same proportion of positive labels (~2%))
The validation set should be as similar as possible to the test set.
When balancing the training set by manually adding real positive examples (~35% positive label) and then applying CV to this data, I violated my second assumption because the positive label proportion in the validation folds was much bigger than the test set.
Another approach I have tried was splitting the dataset into one train set, one validation set, and one test set (hold-out validation), so all my assumptions were kept. However, in this approach, the validation set (and test set) have few observations of the positive labels (less than 50 observations ), and my concern is that kind of overfitting will occur (the model will know to recognize and classify just these few observations as positive labels and having trouble classify new positive observations).
This process made me think about the following approach: create a fixed training set that includes more proportion of positive labels than the real data, and evaluate the model on K validation sets that have the actual minority label ratio. Similar to CV but with a big difference: the model will be trained on the same training fold each time.
Does my approach may work, or is there a based-on-literature approach to handling this situation?