I've trained a simple xgboost model, I didn't even do hyperparameters tuning:
y=2-as.numeric(as.factor(Train_set[,1]))
dtrain <- xgb.DMatrix(data = data.matrix(Train_set[,-1]), label= y)
xgb_params <- list(
booster = 'gbtree',
eta = 0.5,
max_depth = 5,
gamma = 10,
subsample = 0.75,
colsample_bytree = 0.8,
objective = 'binary:logistic',
eval_metric = 'auc')
model <- xgboost(data = dtrain, # the data
params = xgb_params,
nrounds = 50)
xgb_trn = roc(Train_set[,1],predict(model, dtrain))
plot(xgb_trn, main = 'xgb train')
xgb_tst = roc(Test_set[,1], predict(model, data.matrix(Test_set[,-1])))
plot(xgb_tst, main = 'xgb test')
the data is fine, when I use it to predict another target feature, it works. Here, I changed the target variable, I want to predict something else, and all of a sudden every iteration of the model has train-AUC 1.00 :
> model <- xgboost(data = dtrain, # the data
+ params = xgb_params,
+ nrounds = 50)
[1] train-auc:1.000000
[2] train-auc:1.000000
[3] train-auc:1.000000
[4] train-auc:1.000000
[5] train-auc:1.000000
[6] train-auc:1.000000
[7] train-auc:1.000000
[8] train-auc:1.000000
[9] train-auc:1.000000
[10] train-auc:1.000000
Please have a look over this subset of the train set:
structure(list(RESP = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L), levels = c("NoResponse", "Response"
), class = "factor"), gender = c(1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
1, 0, 0, 1, 1), CD4.T.cells = c(-0.0741098696855045, -0.094401270881699,
0.0410284948786532, -0.163302950330185, -0.167314411991775, -0.0366277340916379,
-0.167823357941815, -0.0809646843667242, -0.148668434567449,
-0.0726825919321525, -0.062499826731091, -0.0861178015030313,
-0.117687306656149, -0.141342090175904, -0.15593285099477), CD8.T.cells = c(-0.178835447722468,
-0.253897294559596, -0.0372301980787381, -0.230579110769457,
-0.196933050675633, -0.0550538743643369, -0.162295446209879,
-0.276178546845023, -0.204485697907641, -0.210241384645447, -0.280984614146961,
-0.262382296291096, -0.219293669044801, -0.132089229492629, -0.245230934265636
), T.helpers = c(-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.106812144494277, 0.0686746776877563, 0.00388752358937872,
-0.0729755869081981, -0.0127651150376793, -0.00167101704571948,
-0.0969110962088053, 0.0305873691536314, 0.0719413632572499,
0.0143645056063724, -0.130603441365545), NK.cells = c(-0.0924083910597563,
-0.172356328661097, -0.0172673823614314, 0.0280649471541352,
-0.0875076743713435, -0.0518877213975413, -0.0607256681646583,
-0.184546079512101, -0.114742845076629, -0.0975859552072595,
-0.174457678991973, -0.189516761836086, -0.166242090299456, -0.0104175102470585,
-0.0873057138342297), Monocytes = c(-0.0680848706469295, -0.173427291586957,
-0.0106773958944477, -0.0015805672257001, -0.0737177243152751,
-0.0674023045286274, -0.0168615569795438, -0.149380203815874,
-0.13581671068326, -0.0681190929378688, -0.151342317151381, -0.200427316546487,
-0.102704828499578, 0.000081008521555237, -0.0917016103306532
), Neutrophils = c(-0.0391833488213571, -0.0275279418713283,
0.0156454755097513, 0.0285160860867748, 0.0252778805872529, 0.0432343965225797,
0.0506419186309599, -0.0693846217599099, -0.0682613686336473,
0.00338678597014841, -0.033454713772469, -0.0614961455460419,
-0.0608931312183019, 0.0126256299876522, 0.0408488806008269),
gd.T.Cells = c(-0.162246594987039, -0.297759223265742, -0.0814825699645205,
-0.0688779846190755, -0.264420103679214, -0.162709306032616,
-0.0969595022926227, -0.292342418053931, -0.296763345807347,
-0.221464822653774, -0.215639270235576, -0.323923302627997,
-0.226301570680385, -0.115299138190696, -0.209135117178106
), Non.plasma.B.cells = c(-0.0384755654971015, -0.114370815587458,
0.161268251261644, -0.0571463865006043, -0.0822058328898433,
0.114155959200915, 0.0289800756140739, -0.0923514068231641,
-0.113300459490648, -0.0721711891176557, -0.16311341310655,
-0.101134395440272, -0.0545178013216237, -0.0239839042195814,
-0.111084811179261)), row.names = c("Pt1", "Pt10", "Pt101",
"Pt103", "Pt11", "Pt18", "Pt24", "Pt26", "Pt28", "Pt29", "Pt3",
"Pt30", "Pt31", "Pt34", "Pt37"), class = "data.frame")
the target feature is RESP.
I'm familiar with overfitting, but is this overfitiing? since both sets fit perfectly..