I try to compare the logistic regression with XBGoost on the simulated data. What I found is that XBGoost AUC is better than that of logistic regression, even when logistic regression predict perfect probability (the probability used to generate binary outcome).
Please see details below:
- Simulate X: generate 4 random variables (x1,x2,x3 and x4).See code section A.
- Similate Y: Let log_odds = x1+x2+x3+x4 (setting all coefficient to 1 and intercept to 0). Then convert log_odds to probability, and use probability to generate binary outcome. See code section B.
- Fit logistic regression. The estimated coefficients are very close to ones used for similation.The AUC is 0.834. coef: [[0.92180079 1.07390035 0.97258221 0.80164048]] Intercept [-0.00462648]. See code section C.
- Fit XGBoost. The AUC is 0.908.See code section D.
- Simulate testing set with different random seed. Logistic regression AUC is 0.836, and XGBoost AUC is 0.907. See code section E.
As I understand, when I use simulated probability to generate binary outcomes, I was introducing randomness to the data, which could not be modelled/predicted. However, if the logistic regression already predict probabilities that are so close to the simulated ones, how could XGBoost generate better performance.
Is this a problem of AUC, my test design, or my code? Thank you very much in advance!
import random
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.metrics import classification_report
from numpy.random import seed
from numpy.random import rand
Section A: Simulate X
random.seed(1)
seed(1)
n=10000
x1=np.array(rand(n))4-2
x2=x1+np.array(rand(n))
x3=-np.array(rand(n))1.9
x4=np.array(rand(n))*1
print(sum(x1<=x2)==n)
df=pd.DataFrame({"x1":x1,"x2":x2,"x3":x3,"x4":x4})
Section B: Simulate Y
def logistic(z):
return 1 / (1 + np.exp(-z))
lp=x1+x2+x3+x4
prob = logistic(lp)
y = np.random.binomial(1, prob.flatten())
Section C: Fit logistic regression and check AUC
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(df.values,y)
print("coef: ",LR.coef_, LR.intercept_)
print("AUC: ",LR.score(df, y))
Section D: Fit XGBoost and check AUC
from sklearn.tree import DecisionTreeRegressor, plot_tree
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score, auc, log_loss
fit_xgb= XGBClassifier(booster='gbtree', numWorkers=1, n_estimators=10, minChildWeight=15.0, seed=1,
objective='binary:logistic', maxDepth=3,eta=0.08, reg_lambda=10.0, alpha=10.0,
gamma=0.0, colsampleBytree=0.7, subsample=0.8)
fit_xgb.fit(df, y)
xgb_prob = fit_xgb.predict_proba(df)[:,1]
print('XGB AUC is', roc_auc_score(y, xgb_prob))
Section E: Simulate testing set with different random seed, and check AUC
random.seed(10)
seed(10)
n=10000
x1_1=np.array(rand(n))4-2
x2_1=x1_1+np.array(rand(n))
x3_1=-np.array(rand(n))1.9
x4_1=np.array(rand(n))*1
print(sum(x1_1<=x2_1)==n)
df_1=pd.DataFrame({"x1":x1_1,"x2":x2_1,"x3":x3_1,"x4":x4_1})
lp_1=x1_1+x2_1+x3_1+x4_1
prob_1 = logistic(lp_1)
y_1 = np.random.binomial(1, prob_1.flatten())
xgb_prob_1 = fit_xgb.predict_proba(df_1)[:,1]
print('XGB AUC on testing set is: ', roc_auc_score(y_1, xgb_prob_1))
print("Logistic regression AUC is: ",LR.score(df_1, y_1))