Impact of class weights on logistic regression - excessively low p-values and narrow confidence intervals

Question

I am currently working on a logistic regression problem with an imbalanced dataset. The total number of rows in my input is 51220 (class_0=49,654, class_1=1,566). I use 3 predictors (1 continuous and 2 binary). I ran logistic regression with default parameters (glm in R), but the model's predictions were all below 0.5. I though that this might be because of the imbalanced dataset. So, I estimated class weights as sci-kit learn does (inversely proportional to class frequencies in the dataset) and incorporated them into the logistic regression model. This resulted in class_0_weight=0.52 and class_1_weight=16.35. After using weights, the model predicts values ranging from 0.02 to 0.97, which seems more reasonable.

However, although the coefficient estimates didn't change much, their statistical significance drastically changed. All the coefficients are way more significant. For example, the p-value of one of the coefficients was at 3.56e-21 before using class weights. With class weights in the mode, the p-value plummeted to an extremely low value of 1.4e-136. Such a big difference in the p-values (and subsequently in the standard error and confidence interval) doesn't seem right to me.

Do you think there is a problem with the model, or the way I calculate the class weights? Do you have any suggestions on how to address the imbalance data issue differently? Thank you!

score 6 · Accepted Answer · answered Mar 22 '24 at 09:00

6

Sextus Empiricus answered your core question very well.

However, I think this part deserves more attention:

but the model's predictions were all below 0.5

This is not a problem at all, and "addressing" this "issue" will make your model worse.

Suppose we are modeling and predicting whether I have an accident while driving on any given day. Thus, the outputs from our logistical regression are the probabilities for such an accident. In good conditions, my probability for an accident might be very low, e.g., 0.0001. In worse conditions, like icy roads, my probability to have an accident might be dramatically higher - but still below 0.5, like 0.2. (Even on icy roads, most people do not have an accident.) If our model outputs correct probabilities, this is fine! If the model tells me my probability for an accident is 0.9 when it actually is 0.2, then the model is simply wrong!

It is important here to note the distinction between the probabilistic model prediction and the subsequent decision. Even if I am more likely not to have an accident on icy roads, it may make sense not to drive today, because if I do have an accident, the hassle is much more painful than the cost of not driving. We operationalize this by using a decision threshold that is different from 0.5 and which accounts for costs of decisions and outcomes. Or we might even use more than one threshold. See Reduce Classification Probability Threshold.

Finally, unbalanced data are usually not a problem, and reweighting is similar to over-/undersampling in that it biases your predictions in the way you describe: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

answered Mar 22 '24 at 09:00

Stephan Kolassa

123,354

1

+1 "This is not a problem at all, and "addressing" this "issue" will make your model worse." indeed - especially when you have a large dataset, which this appears to be. – Dikran Marsupial Mar 22 '24 at 09:07
Thank you for your answer! That makes total sense. If I understand correctly now, the model fitted well, and the decision for the classification threshold really depends on the problem. However, I was thinking, isn't one of the logistic regression assumptions that it requires balanced datasets? – Panos Mar 22 '24 at 12:22
No, logistic regression makes no assumption at all about balanced data. (Having too few observations of the minority class can lead to imprecision or bias in the parameter estimates, per one answer in the thread I link to at the bottom. But you have a lot of data, and your imbalance is not all that strong. I do not see this particular detail as attention-worthy here.) Rather, if you artificially change your sample, then the baseline population prevalence will be mis-estimated, which will manifest as a bias in the intercept coefficient. – Stephan Kolassa Mar 22 '24 at 12:43
That's very helpful! If I may ask one last question..my goal here is to obtain an odds ratio for one of the predictors after adjusting for the others in the model, rather than using the model for future predictions. In this scenario and according to your comments, is it valid to use the statistics from the model without the weights? – Panos Mar 22 '24 at 12:56
Absolutely! This is actually absolutely equivalent to running predictions. If you have no interactions, then the estimated OR is simply $e^{\hat{\beta}}$. If you do have interactions involving this predictor, you need to take the relevant coefficients into account... or you can simply calculate the predictions with the predictor at its two settings, convert to logits and divide. Which in the end comes to the same thing. And I would absolutely do this without weighting, because that just introduces bias. – Stephan Kolassa Mar 22 '24 at 13:01
Thank you so much! For both providing insights for my problem and enhancing my understanding of logistic regression. – Panos Mar 22 '24 at 13:12

Sextus Empiricus · Answer 2 · 2024-03-22T07:33:06.463

The p-value might increase because you are pretending to have data that isn't real. What the weights are effectively doing is multiply each measurement from class_1 by 16 as if it where 16 independent measurements with the same outcome.

The noisy data from 1566 points get's multiplied and treated as if they were a result from a solid 25000 points.

So you may get discrepancies from the noisy data, which are subsequently treated as data with lower noise, and that leads to wrong interpretations like low p-values.

(If you like you could obtain honest p-values by fitting with the weights and subsequently evaluate by applying the true statistical model without the weights, but as explained below, there isn't much sense in fitting with the weights)

Additionally, the weights might place a focus on a different part of the range of variables, which may play a role when the logistic model is misspecified (ie when a different function would be a better fit) and has a bias in certain ranges.

but the model's predictions were all below 0.5

This doesn't need to be a problem. It makes sense that the predictions are all below 0.5 when for any value of the predictors the class_0 is in the majority. You don't need to correct for this by changing the model.

If you wish to correct for this because the imbalance in your sampling does not reflect the true population, then you can correct for this by changes after fitting the model (e.g. multiply the fitted odds by some factor). The fitting is ideally done in a way to reflect the model of the measurements (and not neccesarily the model of the true population which may be different of your measurements are only indirectly related to the population, like due to biased sampling).

Thank you very much for your reply! So, if I understand correctly, the initial statistics I obtained from the fitted model without weights are valid and correct, right? Also, my goal here is to obtain an odds ratio for one of the predictors after adjusting for the others in the model, rather than obtaining a model for future predictions. In this scenario, is it valid to use the statistics from the model without the weights? — Panos, Mar 22 '24 at 12:19
@Panos Yes, you understand it correctly. As I indicated in my answer: don't use the weights to correct for the below 0.5 value, it makes sense that it is below 0.5 — Sextus Empiricus, Mar 22 '24 at 14:11

Impact of class weights on logistic regression - excessively low p-values and narrow confidence intervals

2 Answers2