Assume that there is an artifical dataset that allows perfect (linear) seperation in good and bad clients. Why is it that a method such as xgboost is not able to identify the perfect decision boundary?
On the left there is the sample data that consists of ~10,000 data points that can be seperated trivially in good and bad. The plot on the right shows the forecasts of the xgboost model. Why can the model not identify the diagnoal as a perfect linear seperator?
Edit: After the comments I ran the code with nrounds 10,000 and 100,000. This is the result:
rm(list = ls())
library(tidyverse)
library(xgboost)
#Generate input data
x <- y <- seq(0.01,0.99,0.01)
inputtable_xgb <- expand.grid(x = x,y = y) %>%
mutate(target = ifelse(y<x,1,0),`Target Label` = ifelse(y<x,"Good","Bad"))
#xgboost
set.seed(1)
dtrain <- xgb.DMatrix(data.matrix(inputtable_xgb[,c("x","y")]), label=inputtable_xgb[,"target"],missing = -999)
param <- list(objective = "binary:logistic", min_child_weight = 15,eta= .05,max_depth= 10,subsample= 0.75,colsample_bytree= 0.75,eval_metric= "auc")
clf <- xgb.train(params= param, data= dtrain, nrounds= 100,verbose= 1,maximize= FALSE, nthread = 4)
#fill a matrix with forecast values
stepsize <- 101
contour_xgb <- matrix(0,nrow = stepsize,ncol = stepsize)
values <- seq(0,1,by = 0.01)
for (i in 1:(stepsize)){
for (j in 1:(stepsize)){
example_data <- data.frame( x = values[i],y = values[j] )
dtest <- xgb.DMatrix(data.matrix(example_data),missing = -999)
contour_xgb[i,j] <- predict(clf,dtest, ntreelimit = clf$bestInd)
}
}
#generate plot for the input data and the model forecasts
inputtable_xgb %>%
ggplot(aes(x = x,y = y,color = `Target Label`)) +
geom_point() +
labs(title = "Artificial data of good and bad clients")
image(contour_xgb,main = "Contour plot of probability forecasts", xlab = "x", ylab = "y")
contour(contour_xgb, add = TRUE,labcex= 1)





if$p<0.5$, then BAD, else GOOD). if you look at the probability estimates from XGBoost, the 0.5 line falls on the diagonal, so the method has identified the correct separation! – darXider Oct 12 '17 at 14:27rm(list = ls())is bad form. It is not necessary for your example, and no one copy/pasting your code wants to accidentally run that line! – Gregor Thomas Oct 12 '17 at 14:37