How to deal with training and test data that have different imbalances?

Question

I have a dataset made of categorical variables and a binary outcome. Responses about the variables and the outcome were recorded at time 1 and time 2. The data is imbalanced between the two outcomes (10:3) at time 1, and more so at time 2 (40:3).

Does it make sense to create a logistic regression model using the time 1 data, where the variables predict the outcome, and then test the accuracy of that model by seeing how accurately it predicts the time 2 outcomes? Is the imbalance something I need to consider?

So far when I try this, the accuracy of prediction is better for the test set instead of the training set, which I think means that imbalance is a problem- accuracy is low for the training set because the imbalance means one outcome is being predicted a lot more often than the other.

Do you know why the relative frequencies of the two outcomes changed? Without understanding why this happened, how can you know which ratio is more representative of the future? — dipetkov, Jun 26 '22 at 12:55
@dipetkov, yes the performance on the predictor variables was worse, so the occurrence of the poorer outcome increased — user360879, Jun 26 '22 at 14:58
Your description lacks clarity. "The performance on the predictor variables was worse" suggests data drift... You should include all information you have in your question, not just the bits you've decided are relevant. — dipetkov, Jun 26 '22 at 16:40

Dave · Answer 1 · 2023-03-16T17:53:28.450

It is not so unusual for in-sample and out-of-sample data to have differences in the class ratios, just by flukes of randomly sampling to allocate observations to the in-sample and out-of-sample data (unless you make a point to stratify in order to maintain the class ratio). However, your difference is so great that I have to think something is inherently different about the second time period that makes it much more likely to have the first outcome than it is in the first period. You do not have to have many observations to find this difference to be statistically significant.

Consequently, by training in period $1$ and testing in period $2$, you are testing in an inherently different situation.

The reason you have better classification accuracy in the second period than the first period is likely due to the imbalance in the first period leading the model to predict probabilities that are on the low side, probably quite a bit below $0.5$. When you make predictions for the second period, your predictions will still tend to be below a probability of $0.5$, so when you apply a threshold of $0.5$, those get rounded to category $0$, which is much more common in this time period, so you are more likely to get the right answer.

(Or maybe your majority class is coded as $1$, and your predicted probabilities tend to be quite a bit higher than $0.5$. Analogous logic applies.)

If you evaluate the probabilistic predictions in the two time periods however, such as with log loss, Brier score, or even a ROC curve, you are likely to have better performance on the training data from period one than the testing data from period two. I would consider the stronger performance in terms of accuracy to be something of a mirage.

If you have some way to predict the drift of the prior probability in each time period, you could perhaps calibrate your out-of-sample probabilities to reflect the class ratio in that period (instead of, falsely, assuming the class ratio to remain constant). If you have this ability, however, you might be more inclined to use those determinants of the prior probability in a time period as features in your classifier or probability prediction model.

If you do not have the ability to model how the prior probability in each time period changes, yet it does, then you are modeling a nonstationary process with no information about the dynamics. Of course your performance will be poor. (Again, your higher accuracy in the second time period is a mirage. If you evaluate the probabilities directly, you are likely to find worse performance in period two than in period one.)

How to deal with training and test data that have different imbalances?

1 Answers1