I am trying to understand why the output from logistic regression of these two libraries gives different results.
I am using the dataset from UCLA idre tutorial, predicting admit based
on gre, gpa and rank. rank is treated as categorical variable, so it
is first converted to dummy variable with rank_1 dropped. An intercept
column is also added.
py
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
y, X = dmatrices('admit ~ gre + gpa + C(rank)', df, return_type = 'dataframe')
X.head()
> Intercept C(rank)[T.2] C(rank)[T.3] C(rank)[T.4] gre gpa
0 1 0 1 0 380 3.61
1 1 0 1 0 660 3.67
2 1 0 0 0 800 4.00
3 1 0 0 1 640 3.19
4 1 0 0 1 520 2.93
# Output from scikit-learn
model = LogisticRegression(fit_intercept = False)
mdl = model.fit(X, y)
model.coef_
> array([[-1.35417783, -0.71628751, -1.26038726, -1.49762706, 0.00169198,
0.13992661]])
# corresponding to predictors [Intercept, rank_2, rank_3, rank_4, gre, gpa]
# Output from statsmodels
logit = sm.Logit(y, X)
logit.fit().params
> Optimization terminated successfully.
Current function value: 0.573147
Iterations 6
Intercept -3.989979
C(rank)[T.2] -0.675443
C(rank)[T.3] -1.340204
C(rank)[T.4] -1.551464
gre 0.002264
gpa 0.804038
dtype: float64
The output from statsmodels is the same as shown on the idre website, but I
am not sure why scikit-learn produces a different set of coefficients. Does
it minimize some different loss function? Is there any documentation that
states the implementation?