I have a record of data that contains 1.1 millions observations and 14 variables. The response is 0 or 1. It was suggested to me that I use Gradient Boosted Trees to build my logistic model.
Using xgb.cv from xgboost in R, I'm attempting to estimate the best hyperparameters on a holdout of 2/3 of the data. However, the code takes forever to run. It took me 13 hours for learning rate = 0.5, depth = 7, number of folds = 5 and number of trees = 10000. I can't imagine the time it will take to loop over different learning rates and depths.
How could I make the process faster? I guess that reducing the number of trees to 2500 would make sense, based on my error curve. Will reducing the number of folds help? Is it really necessary to do bootstrapping?
My current code looks like this, for reference :
etas = c(0.75,0.5,0.1)
max.depths = c(11,9,7,5,3)
fitAssessmentLst = list()
lstPos = 0
for(eta in etas){
for(max.depth in max.depths){
lstPos = lstPos + 1
x = xgb.cv(params = list(objective="binary:logistic", eta=eta,
max.depth=max.depth, nthread=3),
data = train_data.xgbdm,
nrounds = 10000,
prediction = FALSE,
showsd = TRUE,
nfolds = 5,
verbose = 0,
print.every.n = 1,
early.stop.round = NULL
)
fitAssessmentLst[[lstPos]] = list(eta = eta, max.depth = max.depth, assessmentTbl = x)
}
}
early.stop.roundalong with a hold out or cross validation so that the algorithm will stop short of your10000trees. This is a very nice feature of xgboost you should utilize. – Matthew Drury Sep 07 '16 at 16:37rBayesianOptimizationpackage that usesxgb.cv. I'll give it a shot. – jgadoury Sep 07 '16 at 16:57kto use forearly.stop.round? – jgadoury Sep 07 '16 at 17:00xgb.cvusing 1/10 of my data, which seems to really decrease the runtime, but I'm worried that my hyperparamaters won't translate well to the full dataset. I tried Bayes Optimization and it was still slow as hell. Around 100 minutes per iteration using 2500 rounds. – jgadoury Sep 08 '16 at 19:35xgb.cv? I tried to find an option that I don't see it. Do you recommend usingsamplebefore each iteration? – jgadoury Sep 08 '16 at 19:50