I have a highly unbalanced dataset (1000 vs 60). Where I want to use upsampling. The real life distribution of the problem (predicting no show) is probably also very highly imbalanced. My question is two-fold
1) I know that I should keep the distribution of my training and test set as close to the real life setting so that the distributions are identical, so should I do upsampling?
2) If I should upsample. Should I upsample both my training and test set, or only the training set? Imagine my training set (750 vs 45) and test set (250 vs 15). Should I bring it to (750 vs 750) and (250 vs 250), or to (750 vs 750) and (250 vs 15)?
Any relevant papers concerning this problem are also welcome.