3

I used the ROSE package in R to balance a dataset. I wasn't sure which would yield better results so I split my data into training and test sets (75/25) then over and undersampled my sample before running a logit model. My AUC is the same for the over, under and non-sampled models and I'm not able to rationalize why.

Some details about my sample if it is helpful:

  • I have 3861 observations from BRFSS
  • 84% are class 0, which is why I wanted to try over and under sampling
snerbs
  • 33
  • 3
    Why using obscure methods like under and oversampling? Using logistic regression with its native loss function (logLoss) is fine. – Michael M Dec 30 '22 at 19:27
  • 2
    I think that this page will provide much of what you need to understand your result. Over/under-sampling doesn't end up doing much. Many people wouldn't consider a 16/84 distribution that poorly balanced anyway. With your size sample you have more than 600 observations in your minority class. That's enough to allow some pretty flexible modeling directly, as you can typically fit about 1 parameter per 15 members of the minority class without overfitting. – EdM Dec 30 '22 at 19:27

1 Answers1

5

Changing the prior probability (class ratio) only affects the intercept of a standard logistic regression model. Thus, it is a monotonic transformation of the log-odds. The logistic function is monotonic, so this also represents a monotonic function of the predicted probability values.

Monotonic functions preserve order, and AUC only cares about order. Thus, the AUC does not change.

That you achieve the exact same AUC is slightly surprising for slightly different data sets (due to the over/under-sampling), but that is the gist.

Dave
  • 62,186
  • 3
    Of likely interest: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he – Dave Dec 30 '22 at 19:27