8

experts! Maybe, you know how to calculate confidence interval for xgboost? Classic formula with t-distribution can't help, because my data aren't normally distributed. Or doesn't matter?

If you'll suggest some literature, it will be very useful, but approaches in R and Python (in context of library xgb) good too.

Perhaps, it looks like this, but how it compute? And found this - it's right or not?

P.S.: I can't add some pictures, related with my data (limit of links), sorry.

Lu Wao
  • 121
  • Is this a classification problem? When you say the data is not normal do you mean multivariate normal? – Michael R. Chernick Jan 12 '17 at 06:01
  • @MichaelChernick No, regression problem. I think, data can be named multivariate normal, because I have info about different cities and susidiaries in it. Therefore, my confidence interval related to distribution for each city. – Lu Wao Jan 12 '17 at 06:39
  • The problem is not stated clearly. No way to tell this was a regression problem. I got the impression it was classification based on looking at your links. If it is regression is there just one predictor variable and one dependent variable? If that is the case is it using the t distribution for the regression parameters that you are talking about. It could also be for the a particular fitted value of y (dependent variable) given x (predictor variable) or a predictioninterval for a new value of y. – Michael R. Chernick Jan 12 '17 at 07:12
  • @MichaelChernick In model one dependent var and more than 30 independent vars. Yes, xgb works on trees (that initially solve classification problem), but I used it for regression. – Lu Wao Jan 12 '17 at 07:56
  • Here is post on calculating prediction interval in case of random forest. A similar approach could work for GBMs. – ab90hi Jan 12 '17 at 07:58
  • So this is a regression tree or ensemble of regression trees as in random forests? – Michael R. Chernick Jan 12 '17 at 08:16
  • @MichaelChernick This is ensemble of regression trees. – Lu Wao Jan 12 '17 at 08:32
  • @ab90hi So in that post you state, that source data isn't normally distributed? Because I check your y shapiro.test(), and result is negative. – Lu Wao Jan 12 '17 at 08:38
  • Did you check the normality of Y or the residual of Y? Because I think the residuals of the model do follow normal in the example. – ab90hi Jan 12 '17 at 08:56
  • @ab90hi I check normality of Y. My residuals normal, and question in source data's distribution. – Lu Wao Jan 12 '17 at 09:02
  • 1
    @ab90hi But thank for your answer, now I know, that automatically R compute wrong interval :) – Lu Wao Jan 12 '17 at 11:27

1 Answers1

4

So, this is the answer! (mirror)

To build confidence limits for abnormally distributed data, you first need to build a quantile regression, rather than a linear regression, as it does by default. For this it is necessary, using the derived derivatives from the article or simply copying the code on the python, to customize the variable 'objective'. It is also necessary to change the gradient function and the Gaussian function. After everything is programmed, build a quantile regression for the 50th quantile (this will be the initial regression), and then two quantile regressions for the two boundaries of the interval (for example, 95 and 5). As a result, you get not only a more accurate model for the initial regression, but also the desired intervals.

Franck Dernoncourt
  • 46,817
  • 33
  • 176
  • 288
Lu Wao
  • 121
  • 4
    We are trying to build a permanent repository of high-quality statistical information in the form of questions & answers. Thus, we're wary of link-only answers, due to linkrot. Can you post a full citation & a summary of the information at the link, in case it goes dead? – T.E.G. May 25 '17 at 11:29