I am using Anaconda jupyter notebook to do multinomial logistic regression. The output has 3 possible results {1, 2, 3}. There are 32 independent variables. I have 50k records and I did a 80-20 split.
I am pretty new on this.
My questions are:
- How come logistic model all give prediction of 1? basically the same as baseline model?
- How LogisticRegressionCV with class_weight = 'balanced' has such a low accuracy? while the one without class_weight has a 'high' accuracy?
What did I do wrong?
baseline model
from sklearn.dummy import DummyClassifier
dummy_model = DummyClassifier(strategy = 'most_frequent', random_state = 0)
dummy_model.fit(x_train, y_train)
result:
accuracy score: 0.9706794756329431
confusion matrix:
[[9700 0 0]
[ 211 0 0]
[ 82 0 0]]
classification report:
precision recall f1-score support
1 0.97 1.00 0.99 9700
2 0.00 0.00 0.00 211
3 0.00 0.00 0.00 82
avg / total 0.94 0.97 0.96 9993
logit regression
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression(C=0.05, random_state=18, class_weight='balanced', penalty='l1')
logit_model.fit(x_train, y_train)
result: accuracy score: 0.9706794756329431
confusion matrix:
[[9700 0 0]
[ 211 0 0]
[ 82 0 0]]
classification report:
precision recall f1-score support
1 0.97 1.00 0.99 9700
2 0.00 0.00 0.00 211
3 0.00 0.00 0.00 82
avg / total 0.94 0.97 0.96 9993
logit regression by GridSearchCV:
logit_model_base = LogisticRegression(random_state = 18)
from sklearn.model_selection import GridSearchCV
parameters = {'C': [0.03, 0.05, 0.08, 0.1, 0.3, 0.5, 10], 'penalty': ['l1', 'l2']}
logit_model_best = GridSearchCV(logit_model_base, param_grid = parameters, cv = 3)
logit_model_best.fit(x_train, y_train)
result:
accuracy score: 0.9706794756329431
confusion matrix:
[[9700 0 0]
[ 211 0 0]
[ 82 0 0]]
classification report:
precision recall f1-score support
1 0.97 1.00 0.99 9700
2 0.00 0.00 0.00 211
3 0.00 0.00 0.00 82
avg / total 0.94 0.97 0.96 9993
LogisticRegressionCV with class_weight = 'balanced'
from sklearn.linear_model import LogisticRegressionCV
logit_model_cv = LogisticRegressionCV(cv = 10, class_weight = 'balanced')
logit_model_cv.fit(x_train, y_train)
result:
accuracy score: 0.2982087461222856
confusion matrix:
[[2831 3384 3485]
[ 36 104 71]
[ 9 28 45]]
classification report:
precision recall f1-score support
1 0.98 0.29 0.45 9700
2 0.03 0.49 0.06 211
3 0.01 0.55 0.02 82
avg / total 0.96 0.30 0.44 9993
LogisticRegressionCV without class_weight = 'balanced'
from sklearn.linear_model import LogisticRegressionCV
logit_model_cv = LogisticRegressionCV(cv = 10)
logit_model_cv.fit(x_train, y_train)
result:
accuracy score: 0.9706794756329431
confusion matrix:
[[9700 0 0]
[ 211 0 0]
[ 82 0 0]]
classification report:
precision recall f1-score support
1 0.97 1.00 0.99 9700
2 0.00 0.00 0.00 211
3 0.00 0.00 0.00 82
avg / total 0.94 0.97 0.96 9993
DummyClassifier,LogisticRegression,GridSearchCV, andLogisticRegressionCV, or what the parameter settings in the function calls are intended to achieve (like thepenalty='l1'setting in the call toLogistic Regression). It would help if you could explain in words the idea behind each of the function calls. Also, have you considered using what's called a proper scoring rule for evaluating results instead? – EdM Sep 14 '18 at 22:11