0

I am running an XGBoost model to predict the global economic cost of invasive species. My training set is only about 3000 data points.

I am bootstrapping my predictions, and went with the default 1000 samples. I see other questions in which answers recommend at least 1000 samples (Rule of thumb for number of bootstrap samples).

My problem is that the hyperparameter grid search is going too slow, and the whole thing is estimated to take a month to run. I don't have access to a bigger computer now to increase parallelization. If a co-author or reviewer tells me to change anything in the model it will take another month to run again.

So far I have run 43 samples, and calculated the mean R-squared for increasing number of samples (plotted bellow). I see the mean R-squared is close to stabilizing at 43. Could I get away with only 50 or 100 bootstraps if I show the R-squared doesn't change much after this, or are there other reasons I should run it 1000 times?

enter image description here

  • 1
    This answer has some pointers, it'll depend quite a bit on the sampling variability. You also didn't mention which statistic you're bootstrapping, a mean will stabilize more quickly than most but you don't need resampling for the mean itself. (by that I mean that you probably won't be looking at just the mean of your bootstrap distribution) – PBulls Feb 16 '24 at 11:58
  • I want to get mean and 95% confidence intervals for my predictions. – Gabriel De Oliveira Caetano Feb 16 '24 at 12:23
  • 1
    And is that confidence interval then based on the sample quantiles, standard deviation, ... of your bootstrap distribution? The answer to that question will determine for which statistic you'll have to quantify the sampling error, and with that you can approximate what the precision of your bootstrap is per the above link (or what number of resamples you need for a desired precision). – PBulls Feb 16 '24 at 12:31
  • On the sample quantiles of the bootstrap distribution. At least that is what I understood from this answer in a different question (https://datascience.stackexchange.com/a/26370). – Gabriel De Oliveira Caetano Feb 16 '24 at 15:46
  • Ok, I tried following the procedure in the link mentioned. If I calculated it correctly, B_1 for a symmetrical 95% confidence interval, with r=5% and τ=5% is 1390.586. So I would have to run more than 1000 bootstrap samples to estimate the necessary number of bootstrap samples. I guess I will have to bite the bullet and do it if there is no other way. At least it might save me time if B_opt is smaller than 1000 and I have to run the model again in the future. – Gabriel De Oliveira Caetano Feb 16 '24 at 16:36

0 Answers0