How come the AUC of both train and test sets is 1?

Question

I've trained a simple xgboost model, I didn't even do hyperparameters tuning:

y=2-as.numeric(as.factor(Train_set[,1]))
dtrain <- xgb.DMatrix(data = data.matrix(Train_set[,-1]), label= y)
xgb_params <- list(
  booster = 'gbtree',
  eta = 0.5,
  max_depth = 5,
  gamma = 10,
  subsample = 0.75,
  colsample_bytree = 0.8,
  objective = 'binary:logistic',
  eval_metric = 'auc')
model <- xgboost(data = dtrain, # the data
                 params = xgb_params,
                 nrounds = 50)
xgb_trn = roc(Train_set[,1],predict(model, dtrain))
plot(xgb_trn, main = 'xgb train')
xgb_tst = roc(Test_set[,1], predict(model, data.matrix(Test_set[,-1])))
plot(xgb_tst, main = 'xgb test')

the data is fine, when I use it to predict another target feature, it works. Here, I changed the target variable, I want to predict something else, and all of a sudden every iteration of the model has train-AUC 1.00 :

> model <- xgboost(data = dtrain, # the data
+                  params = xgb_params,
+                  nrounds = 50)
[1] train-auc:1.000000 
[2] train-auc:1.000000 
[3] train-auc:1.000000 
[4] train-auc:1.000000 
[5] train-auc:1.000000 
[6] train-auc:1.000000 
[7] train-auc:1.000000 
[8] train-auc:1.000000 
[9] train-auc:1.000000 
[10]    train-auc:1.000000

Please have a look over this subset of the train set:

structure(list(RESP = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L), levels = c("NoResponse", "Response"
), class = "factor"), gender = c(1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 
1, 0, 0, 1, 1), CD4.T.cells = c(-0.0741098696855045, -0.094401270881699, 
0.0410284948786532, -0.163302950330185, -0.167314411991775, -0.0366277340916379, 
-0.167823357941815, -0.0809646843667242, -0.148668434567449, 
-0.0726825919321525, -0.062499826731091, -0.0861178015030313, 
-0.117687306656149, -0.141342090175904, -0.15593285099477), CD8.T.cells = c(-0.178835447722468, 
-0.253897294559596, -0.0372301980787381, -0.230579110769457, 
-0.196933050675633, -0.0550538743643369, -0.162295446209879, 
-0.276178546845023, -0.204485697907641, -0.210241384645447, -0.280984614146961, 
-0.262382296291096, -0.219293669044801, -0.132089229492629, -0.245230934265636
), T.helpers = c(-0.0384421660291032, -0.0275306107582565, 0.186447606591857, 
-0.124972070102036, -0.106812144494277, 0.0686746776877563, 0.00388752358937872, 
-0.0729755869081981, -0.0127651150376793, -0.00167101704571948, 
-0.0969110962088053, 0.0305873691536314, 0.0719413632572499, 
0.0143645056063724, -0.130603441365545), NK.cells = c(-0.0924083910597563, 
-0.172356328661097, -0.0172673823614314, 0.0280649471541352, 
-0.0875076743713435, -0.0518877213975413, -0.0607256681646583, 
-0.184546079512101, -0.114742845076629, -0.0975859552072595, 
-0.174457678991973, -0.189516761836086, -0.166242090299456, -0.0104175102470585, 
-0.0873057138342297), Monocytes = c(-0.0680848706469295, -0.173427291586957, 
-0.0106773958944477, -0.0015805672257001, -0.0737177243152751, 
-0.0674023045286274, -0.0168615569795438, -0.149380203815874, 
-0.13581671068326, -0.0681190929378688, -0.151342317151381, -0.200427316546487, 
-0.102704828499578, 0.000081008521555237, -0.0917016103306532
), Neutrophils = c(-0.0391833488213571, -0.0275279418713283, 
0.0156454755097513, 0.0285160860867748, 0.0252778805872529, 0.0432343965225797, 
0.0506419186309599, -0.0693846217599099, -0.0682613686336473, 
0.00338678597014841, -0.033454713772469, -0.0614961455460419, 
-0.0608931312183019, 0.0126256299876522, 0.0408488806008269), 
    gd.T.Cells = c(-0.162246594987039, -0.297759223265742, -0.0814825699645205, 
    -0.0688779846190755, -0.264420103679214, -0.162709306032616, 
    -0.0969595022926227, -0.292342418053931, -0.296763345807347, 
    -0.221464822653774, -0.215639270235576, -0.323923302627997, 
    -0.226301570680385, -0.115299138190696, -0.209135117178106
    ), Non.plasma.B.cells = c(-0.0384755654971015, -0.114370815587458, 
    0.161268251261644, -0.0571463865006043, -0.0822058328898433, 
    0.114155959200915, 0.0289800756140739, -0.0923514068231641, 
    -0.113300459490648, -0.0721711891176557, -0.16311341310655, 
    -0.101134395440272, -0.0545178013216237, -0.0239839042195814, 
    -0.111084811179261)), row.names = c("Pt1", "Pt10", "Pt101", 
"Pt103", "Pt11", "Pt18", "Pt24", "Pt26", "Pt28", "Pt29", "Pt3", 
"Pt30", "Pt31", "Pt34", "Pt37"), class = "data.frame")

the target feature is RESP.

I'm familiar with overfitting, but is this overfitiing? since both sets fit perfectly..

The first thing to do is start looking for bugs. The most likely cause is that your data accidentally contains a feature that determines the target, or a preprocessing error accidentally encodes the target. On the other hand, some problems are just easy. There's really no way for a person who doesn't have access to every step of the data collection and processing pipeline to write a definitive answer. — Sycorax, Dec 13 '22 at 16:14

score 1 · Answer 1 · answered Dec 14 '22 at 21:03

This happens because y has opposite encoding from Train_set[,1]. Your y maps a 2 in Train_set[,1] to 0 and 1 to 1, but then you build a ROC curve with Train_set[,1], which has opposite orientation compared to y. This will reverse the ROC AUC value, so a model with AUC=0 will report as AUC=1 and vice-versa. (To see why, work out an example AUC calculation by hand, then reverse the labels and do it again. See: How to calculate Area Under the Curve (AUC), or the c-statistic, by hand)

To fix this, don't switch the orientation of your labels. Either use y everywhere, or use Train_set[,1] everywhere.

How come the AUC of both train and test sets is 1?

1 Answers1