2

I created logistic regression model, however my data is very imbalanced (92% vs 7%) so I created both balanced and imbalanced version using sklearn. For my version on the left, I used:

clf = LogisticRegression()

for my weight-balanced version (on the right) I did:

clf = LogisticRegression(class_weight='balanced')

When trying to calculate their Odds ratio, I encountered a confusing discovery that both of them have coefficient and interecept of 0 (or incredibly close to it compare to other models with similar data) while their Odds ratio is 1. Is there a specific reason?

enter image description here

Edit (MRE for left hand graph):

    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    import matplotlib.pyplot as plt
    from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.2, random_state=0)

# fit the logistic regression model to the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)
# predict the labels for the test set
y_pred = clf.predict(X_test)

# plot the data points
plt.scatter(X_train, y_train)

# plot the logistic regression line
plt.plot(x_range, y_range, color='red')
plt.ylabel(y_var)
plt.title(x_var)
plt.show()
print(f"\n Coefficient: {clf.coef_[0][0]} Intercept: {clf.intercept_[0]}\n\n Odds Ratio (OR): {np.exp(clf.coef_)[0][0].round(3)}")

Sam333
  • 123
  • When I feed the coefficients under your left-hand plot into R and recreate the fitted probabilities via $p=\frac{1}{1+\exp(-(\beta_0+\beta_1x))}$, I get a different line, almost a horizontal line at $p\approx 0.5$. Are you sure you do not have an error? Can you share your data with us? – Stephan Kolassa Jan 25 '23 at 07:24
  • 1
    That said, the fit in your left plot looks reasonable. The one on the right does not. This is because your "balancing" of data biased the model by messing up the data it gets to see. "Unbalanced" data is still no issue for logistic regression, and oversampling will not "solve" a non-problem. – Stephan Kolassa Jan 25 '23 at 07:27
  • @StephanKolassa unfortunately I can't share the data or be more specific about what it represents, however thanks for your insight. I was wondering, in the first comment you say you get almost a horizontal line at p≈0.5 (isn't that what my second chart depicts, i.e. the balanced data) I understand why I data imbalance shouldn't be the issue, but as you said, the chart should be like a horizontal line. Or do you suggest that my coefficients and interecepts are wrong and maybe error is there? – Sam333 Jan 25 '23 at 09:56
  • Judging from the plots, the fit should rather more look like the fit on the left (which I can't reproduce), since there seem to be fewer 1s over large values of income. It's hard to say without the underlying data. – Stephan Kolassa Jan 25 '23 at 09:58
  • @StephanKolassa please find the data in this pastebin (will automatically burn after a couple of day) https://pastebin.com/k5V7tDmi Thank you! – Sam333 Jan 25 '23 at 10:26
  • Your left graph is rounded too small. The sigmoid is well defined, but the SD of X is appears to be more than 10,000. Either standardize X or display the OR, Intercept out to 8 decimal places. Also your "balancing" appears to remove the association entirely, since the sigmoid is completely flat. It's really a mystery how "balancing" ever became such a discussion in statistics, and especially in ML. – AdamO Jan 25 '23 at 16:21

1 Answers1

4

I recreated your model in R after putting the data you provided into two vectors X and Y (since you seem to be concerned about protecting your data, I will not reproduce it here):

model <- glm(Y~X,family="binomial")
plot(X,Y,pch=19,las=1)
X_pred <- seq(min(X),max(X),by=100)
lines(X_pred,predict(model,newdata=data.frame(X=X_pred),type="response"),col="red")

model

We see that this reproduces the left hand plot in your question. Per my comments, it makes no sense to "balance" data, so we will stick with this model.

A call summary(model) gives this output (snipped):

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.998e-03  1.175e+00  -0.009    0.993
X           -4.867e-05  3.081e-05  -1.580    0.114

The parameter estimate for the X coefficient matches your picture, but the intercept is quite different, -9.998e-03 against -1.209e-09. Since your picture matches this one, I assume you do have the correct parameter estimates and just had a typo in preparing your picture.

Now, why are these parameter estimates so small? That is just the way it is. Your model is fitted in a way to give a good fit for the probability that Y=1, and your predictor is on a scale of tens of thousands (e04), so a regression parameter estimate of -4.867e-05 makes sense when multiplied by your X. And the intercept then follows, given your data. Put differently, the predictions plotted make sense, going from $\hat{P}(Y=1|X=23,300)\approx 0.24$ to $\hat{P}(Y=1|X=58,500)\approx 0.05$ per

predict(model,newdata=data.frame(X=X_pred),type="response")

As to your odds ratios, we don't have enough information. An OR is a ratio between odds, specifically the odds estimated for two different situations, so we need to know which two situations (i.e., values of X) your ORs were calculated based on. For instance, the OR between X=23,300 and X=58500 is 5.55:

$$ \frac{\frac{0.24}{1-0.24}}{\frac{0.05}{1-0.05}}\approx 5.55 $$

In R:

pp <- predict(model,newdata=data.frame(X=c(23300,58500)),type="response")
(odds_ratio <- (pp[1]/(1-pp[1]))/(pp[2]/(1-pp[2])))
#        1 
# 5.546423 
Stephan Kolassa
  • 123,354
  • Thank you so much for your detailed answer, I really appreciate it. Unfortunately I have copied the values from the model correctly so my intercept indeed -1.20e-09 however it might be because of the different implementation that the library uses? I am not familiar with R at all, but in my case I used this library function in python – Sam333 Jan 25 '23 at 14:51
  • Especially with very small numbers, there may well be issues with numerical precision. But if sklearn truly used those numbers, then the plot should look different. Can you edit your post with a minimal amount of code to recreate the left-hand plot? – Stephan Kolassa Jan 25 '23 at 15:32
  • I added the MRE to the original question, note that I did only use 80% of my datapoints for training because I left 20% for testing of the model, however even if I did the regression on all of the datapoints, the intercept and coefficient would stay the same for me. – Sam333 Jan 25 '23 at 16:01
  • There wouldn't be any visible difference between $10^{-3}$ and $10^{-9}$ for the intercept here: you would get identical plots to within a fraction of a pixel. One would fully expect numerical difficulties, knowing the solution is a numerical algorithm that likely is terminated once somewhere between 4 and 8 significant decimal digits are found in the estimates and their associated likelihood. – whuber Jan 25 '23 at 16:31