6

I try to compare the logistic regression with XBGoost on the simulated data. What I found is that XBGoost AUC is better than that of logistic regression, even when logistic regression predict perfect probability (the probability used to generate binary outcome).

Please see details below:

  1. Simulate X: generate 4 random variables (x1,x2,x3 and x4).See code section A.
  2. Similate Y: Let log_odds = x1+x2+x3+x4 (setting all coefficient to 1 and intercept to 0). Then convert log_odds to probability, and use probability to generate binary outcome. See code section B.
  3. Fit logistic regression. The estimated coefficients are very close to ones used for similation.The AUC is 0.834. coef: [[0.92180079 1.07390035 0.97258221 0.80164048]] Intercept [-0.00462648]. See code section C.
  4. Fit XGBoost. The AUC is 0.908.See code section D.
  5. Simulate testing set with different random seed. Logistic regression AUC is 0.836, and XGBoost AUC is 0.907. See code section E.

As I understand, when I use simulated probability to generate binary outcomes, I was introducing randomness to the data, which could not be modelled/predicted. However, if the logistic regression already predict probabilities that are so close to the simulated ones, how could XGBoost generate better performance.

Is this a problem of AUC, my test design, or my code? Thank you very much in advance!

import random
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.metrics import classification_report
from numpy.random import seed
from numpy.random import rand

Section A: Simulate X

random.seed(1) seed(1) n=10000 x1=np.array(rand(n))4-2 x2=x1+np.array(rand(n)) x3=-np.array(rand(n))1.9
x4=np.array(rand(n))*1 print(sum(x1<=x2)==n) df=pd.DataFrame({"x1":x1,"x2":x2,"x3":x3,"x4":x4})

Section B: Simulate Y

def logistic(z): return 1 / (1 + np.exp(-z)) lp=x1+x2+x3+x4 prob = logistic(lp) y = np.random.binomial(1, prob.flatten())

Section C: Fit logistic regression and check AUC

from sklearn.linear_model import LogisticRegression LR = LogisticRegression() LR.fit(df.values,y) print("coef: ",LR.coef_, LR.intercept_) print("AUC: ",LR.score(df, y))

Section D: Fit XGBoost and check AUC

from sklearn.tree import DecisionTreeRegressor, plot_tree from xgboost import XGBClassifier from sklearn.metrics import classification_report from sklearn.metrics import roc_auc_score, auc, log_loss

fit_xgb= XGBClassifier(booster='gbtree', numWorkers=1, n_estimators=10, minChildWeight=15.0, seed=1, objective='binary:logistic', maxDepth=3,eta=0.08, reg_lambda=10.0, alpha=10.0, gamma=0.0, colsampleBytree=0.7, subsample=0.8)

fit_xgb.fit(df, y) xgb_prob = fit_xgb.predict_proba(df)[:,1] print('XGB AUC is', roc_auc_score(y, xgb_prob))

Section E: Simulate testing set with different random seed, and check AUC

random.seed(10) seed(10) n=10000 x1_1=np.array(rand(n))4-2 x2_1=x1_1+np.array(rand(n)) x3_1=-np.array(rand(n))1.9
x4_1=np.array(rand(n))*1 print(sum(x1_1<=x2_1)==n) df_1=pd.DataFrame({"x1":x1_1,"x2":x2_1,"x3":x3_1,"x4":x4_1})

lp_1=x1_1+x2_1+x3_1+x4_1 prob_1 = logistic(lp_1) y_1 = np.random.binomial(1, prob_1.flatten())

xgb_prob_1 = fit_xgb.predict_proba(df_1)[:,1] print('XGB AUC on testing set is: ', roc_auc_score(y_1, xgb_prob_1)) print("Logistic regression AUC is: ",LR.score(df_1, y_1))

Watchung
  • 307
  • 3
    Looks like you're evaluating the model on the same data it was trained. What happens when you use a hold out set to evaluate ROC? – Demetri Pananos Apr 12 '23 at 16:15
  • @DemetriPananos In step 5 (see code section E), I change the random seed from 1 (for training set) to 10 (for testing set). Thus the testing data is separately generated. The training set AUC are (Logistic: 0.834; XGBoost: 0.908); the testing set AUC are (Logistic: 0.836; XGBoost: 0.907) – Watchung Apr 12 '23 at 16:48
  • 1
    Unfortunately, this is primarily a code issue, the code compares mean accuracy against AUC-ROC. Please see my answer below for more details. In general when comparing probabilistic estimates using Brier Score is preferable but this isn't the main issue here. (+1 though, fun question) – usεr11852 Apr 12 '23 at 17:49
  • @usεr11852 Thank you very much! Just accepted the answer, and will look in to the Brier Score as well. – Watchung Apr 12 '23 at 18:09
  • 2
    You're not actually fitting a logistic regression. You are fitting a logistic ridge regression. For some reason the default in scikit-learn is to add a ridge penalty, which can help or hurt your ability to perfectly recover a regression relationship. – Björn Apr 12 '23 at 19:35

1 Answers1

5

I have upvoted Sycorax's answer as useful. Nevertheless, I think there is a serious issue with the question: sklearn.linear_model.LogisticRegression.score returns the mean accuracy, not AUC-ROC. If we used LR.predict_proba(df_1)[:,1] to get the predicted probabilistic estimates AUC-ROC values both in the training and testing sets would be higher for the "perfect" logistic regresssion model than XGBoost. For example, in the testing set, XGBoost's AUC-ROC is: 0.9071 and the AUC-ROC score from the logistic regression is: 0.9167.

usεr11852
  • 44,125
  • Yep, that would do it. – Sycorax Apr 12 '23 at 17:42
  • Still a useful answer from you though! :) I started looking at this thinking: "The Brier score should pick this up, I will talk about having proper scoring rules, etc." then I was like "Oh... darn." – usεr11852 Apr 12 '23 at 17:42
  • 2
    Honestly, it's mystifying to me that sklearn implements the score method at all. It's just another sharp corner that people get cut on all the time. Once the library matured to letting you specify a scoring metric to use for pipelines (like CV), the choices baked into the default score became a trap. – Sycorax Apr 12 '23 at 17:50
  • Why would the logistic regression be better in-sample? – Dave Apr 12 '23 at 18:01
  • @Dave: Sorry, I might misinterpret your question. I haven't said anything about the in/out-of-sample distinction in my answer, I just mention testing sets for reference. That said, using a testing/out-of-sample set is just trying to avoid evaluating the learners on the same data it was trained. Obviously, given we know the DGP and we essentially back-fit against it, we should not expect substantial differences in the metrics between in- and out-of-sample evaluations. – usεr11852 Apr 12 '23 at 20:11