1

My model has decently high AUC=90%, but is biased, underestimating the probability $y=1$. This is systematic across some of the input features as well. How can I nudge the bias term, or otherwise address this issue? I am surprised that the model ends up being biased, despite it having an intercept (bias term). My dataset is very unbalanced, only 3% positives (1s) vs 97% 0s. But in y_hat, the number of 1s is closer to 2.5%.

  • 3
    Binary logistic regression estimates the probability that an observation is in class 1. Based on your description, there are relatively few 1s, so the model is (correctly) assigning a small probability to the 1s. This does not seem to be an example of statistical bias, because that has a technical meaning in statistics which does not appear to apply here. – Sycorax May 04 '22 at 00:28
  • Yes, the real data has small number of positive (1s), around 3%. But the model estimates even lower, around 2.5%. Would I be able to somehow nudge the model during training to estimate roughly same proportion of 1s as there are in the training set? – user623949 May 04 '22 at 01:03
  • 1
    Logistic regression doesn't predict 1s or 0s, it predicts probabilities. Reading between the lines, I think you're assigning 1s and 0s to the probabilities by applying some cutoff. You can choose the cutoff however you like; see https://stats.stackexchange.com/questions/25389/obtaining-predicted-values-y-1-or-0-from-a-logistic-regression-model-fit – Sycorax May 04 '22 at 01:05
  • Yes, let me add 2 points of detail:
    1. When I randomly select the cutoff between (0,1), I do get the underestimation (2.5% vs 3%)
    2. When I sum up the probabilities in y_hat, and divide by the sum of y_real, the ratio does turn out to be roughly 2.5 / 3.

    Do you suggest to address this by lowering the threshold at application time?

    – user623949 May 04 '22 at 01:08
  • 1
    Why would you randomly choose the cutoff if your desired result is to have a cutoff that predicts 3% 1s? – Sycorax May 04 '22 at 01:14
  • I thought because the AUC is high, it wouldn't matter. Can you please point me to a resource, how do I correctly calibrate this threshold, or optimize for it? Thanks – user623949 May 04 '22 at 01:21
  • 1
    https://stats.stackexchange.com/questions/25389/obtaining-predicted-values-y-1-or-0-from-a-logistic-regression-model-fit and links in https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification/127044 and https://stats.stackexchange.com/questions/225843/why-p0-5-cutoff-is-not-optimal-for-logistic-regression – Sycorax May 04 '22 at 01:22
  • Thank you, sorry about the confusion – user623949 May 04 '22 at 01:23

0 Answers0