0

I've trained a simple xgboost model, I didn't even do hyperparameters tuning:

y=2-as.numeric(as.factor(Train_set[,1]))
dtrain <- xgb.DMatrix(data = data.matrix(Train_set[,-1]), label= y)
xgb_params <- list(
  booster = 'gbtree',
  eta = 0.5,
  max_depth = 5,
  gamma = 10,
  subsample = 0.75,
  colsample_bytree = 0.8,
  objective = 'binary:logistic',
  eval_metric = 'auc')

model <- xgboost(data = dtrain, # the data params = xgb_params, nrounds = 50)

xgb_trn = roc(Train_set[,1],predict(model, dtrain)) plot(xgb_trn, main = 'xgb train')

xgb_tst = roc(Test_set[,1], predict(model, data.matrix(Test_set[,-1]))) plot(xgb_tst, main = 'xgb test')

the data is fine, when I use it to predict another target feature, it works. Here, I changed the target variable, I want to predict something else, and all of a sudden every iteration of the model has train-AUC 1.00 :

> model <- xgboost(data = dtrain, # the data
+                  params = xgb_params,
+                  nrounds = 50)
[1] train-auc:1.000000 
[2] train-auc:1.000000 
[3] train-auc:1.000000 
[4] train-auc:1.000000 
[5] train-auc:1.000000 
[6] train-auc:1.000000 
[7] train-auc:1.000000 
[8] train-auc:1.000000 
[9] train-auc:1.000000 
[10]    train-auc:1.000000 

Please have a look over this subset of the train set:

structure(list(RESP = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L), levels = c("NoResponse", "Response"
), class = "factor"), gender = c(1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 
1, 0, 0, 1, 1), CD4.T.cells = c(-0.0741098696855045, -0.094401270881699, 
0.0410284948786532, -0.163302950330185, -0.167314411991775, -0.0366277340916379, 
-0.167823357941815, -0.0809646843667242, -0.148668434567449, 
-0.0726825919321525, -0.062499826731091, -0.0861178015030313, 
-0.117687306656149, -0.141342090175904, -0.15593285099477), CD8.T.cells = c(-0.178835447722468, 
-0.253897294559596, -0.0372301980787381, -0.230579110769457, 
-0.196933050675633, -0.0550538743643369, -0.162295446209879, 
-0.276178546845023, -0.204485697907641, -0.210241384645447, -0.280984614146961, 
-0.262382296291096, -0.219293669044801, -0.132089229492629, -0.245230934265636
), T.helpers = c(-0.0384421660291032, -0.0275306107582565, 0.186447606591857, 
-0.124972070102036, -0.106812144494277, 0.0686746776877563, 0.00388752358937872, 
-0.0729755869081981, -0.0127651150376793, -0.00167101704571948, 
-0.0969110962088053, 0.0305873691536314, 0.0719413632572499, 
0.0143645056063724, -0.130603441365545), NK.cells = c(-0.0924083910597563, 
-0.172356328661097, -0.0172673823614314, 0.0280649471541352, 
-0.0875076743713435, -0.0518877213975413, -0.0607256681646583, 
-0.184546079512101, -0.114742845076629, -0.0975859552072595, 
-0.174457678991973, -0.189516761836086, -0.166242090299456, -0.0104175102470585, 
-0.0873057138342297), Monocytes = c(-0.0680848706469295, -0.173427291586957, 
-0.0106773958944477, -0.0015805672257001, -0.0737177243152751, 
-0.0674023045286274, -0.0168615569795438, -0.149380203815874, 
-0.13581671068326, -0.0681190929378688, -0.151342317151381, -0.200427316546487, 
-0.102704828499578, 0.000081008521555237, -0.0917016103306532
), Neutrophils = c(-0.0391833488213571, -0.0275279418713283, 
0.0156454755097513, 0.0285160860867748, 0.0252778805872529, 0.0432343965225797, 
0.0506419186309599, -0.0693846217599099, -0.0682613686336473, 
0.00338678597014841, -0.033454713772469, -0.0614961455460419, 
-0.0608931312183019, 0.0126256299876522, 0.0408488806008269), 
    gd.T.Cells = c(-0.162246594987039, -0.297759223265742, -0.0814825699645205, 
    -0.0688779846190755, -0.264420103679214, -0.162709306032616, 
    -0.0969595022926227, -0.292342418053931, -0.296763345807347, 
    -0.221464822653774, -0.215639270235576, -0.323923302627997, 
    -0.226301570680385, -0.115299138190696, -0.209135117178106
    ), Non.plasma.B.cells = c(-0.0384755654971015, -0.114370815587458, 
    0.161268251261644, -0.0571463865006043, -0.0822058328898433, 
    0.114155959200915, 0.0289800756140739, -0.0923514068231641, 
    -0.113300459490648, -0.0721711891176557, -0.16311341310655, 
    -0.101134395440272, -0.0545178013216237, -0.0239839042195814, 
    -0.111084811179261)), row.names = c("Pt1", "Pt10", "Pt101", 
"Pt103", "Pt11", "Pt18", "Pt24", "Pt26", "Pt28", "Pt29", "Pt3", 
"Pt30", "Pt31", "Pt34", "Pt37"), class = "data.frame")

the target feature is RESP.

I'm familiar with overfitting, but is this overfitiing? since both sets fit perfectly..

  • 5
    The first thing to do is start looking for bugs. The most likely cause is that your data accidentally contains a feature that determines the target, or a preprocessing error accidentally encodes the target. On the other hand, some problems are just easy. There's really no way for a person who doesn't have access to every step of the data collection and processing pipeline to write a definitive answer. – Sycorax Dec 13 '22 at 16:14

1 Answers1

1

This happens because y has opposite encoding from Train_set[,1]. Your y maps a 2 in Train_set[,1] to 0 and 1 to 1, but then you build a ROC curve with Train_set[,1], which has opposite orientation compared to y. This will reverse the ROC AUC value, so a model with AUC=0 will report as AUC=1 and vice-versa. (To see why, work out an example AUC calculation by hand, then reverse the labels and do it again. See: How to calculate Area Under the Curve (AUC), or the c-statistic, by hand)

To fix this, don't switch the orientation of your labels. Either use y everywhere, or use Train_set[,1] everywhere.

Sycorax
  • 90,934