Is it possible to get different result of bootstrap sample statistics using different software ?
-
4Absolutely, you can get different results from the same software. It's all about the randomness (and implementation). – user2974951 Apr 07 '22 at 10:14
-
Will the difference will still exist for large number of bootstrap sample? like e.g more than 5000 samples? – Ray Hope Apr 07 '22 at 10:15
-
Depends a lot on the data, how variable it is. 5000 is a good number, but you would really have to try to know. If you run it twice with 5000 and get very similar results then you are probably good. – user2974951 Apr 07 '22 at 10:19
-
Searching our site produces several good answers. – whuber Apr 07 '22 at 18:51
2 Answers
You can get different results from the same software! Run the boot package in R with set.seed(1) before your bootstrap code and then with set.seed(2). Your results should differ at least a little.
If you go run a bootstrap in Python (SAS, Stata, etc), you will be taking yet a third set of bootstrap samples, giving a third result.
- 62,186
Suppose I have a sample x of size $n = 900$ with the
summary statistics below:
summary(x); sd(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.68 45.30 50.08 49.96 54.42 70.49
[1] 6.858865 # sample SD
Then the (estimated) standard error is $S/\sqrt{n} = 6.859/30 = 0.229.$ Assuming that data are normal, we would have the 95% t confidence interval for $\bar X \pm 1.968 S/\sqrt{n},$ where $1.968$ cuts probability $0.025$ from the upper tail of (symmetrical) Student's t distribution with 899 degrees of freedom.
qt(.975, 899)
[1] 1.962606
By computation or from the t.test procedure in R, we get the CI $(49.51, 50.41).$
The margin of error of this CI is about $1.963(0.229)=0.4495.$
t.test(x)$conf.int
[1] 49.51063 50.40805
attr(,"conf.level")
[1] 0.95
If the sample truly is from a normal distribution, then this is a valid 95% CI for the population mean $\mu$ and we are done.
However, if we have reason to doubt that the data are normal we might find a 95% nonparametric bootstrap CI for $\mu.$ There are many possible styles of bootstrap CIs. I will illustrate one.
The observed sample mean is $49.96.$
a.obs = mean(x); a.obs
[1] 49.95934
We take many (3000) re-samples of size 900 with replacement from x in order to get an idea of the sampling error of the sample mean. For each re-sample we find of its mean from the observed mean of the original sample.
set.seed(2021)
d = replicate(3000, mean(sample(x,900,rep=T))-a.obs)
The deviations $d$ are mainly between $\pm 0.452.$ which is not much different from the margin of error of the 95% t confidence interval above. The 95% nonparametric bootstrap CI is $(59.51, 50.41),$ which id agreement with the 95% t CI above.
UL = quantile(d, c(.975,.025))
UL
97.5% 2.5%
0.4520234 -0.4522723
a.obs - UL
97.5% 2.5%
49.50732 50.41161
If I run this same bootstrap procedure twice again (unknown seeds) I get bootstrap CIs $(49.49425,\, 50.42611)$ and $(49.49007,\, 50.41792),$ which are about the same as the first one above, for practical purposes.
Why not all bootstrap CIs from a sample are exactly the same:
Because bootstrapping depends on random re-sampling you can't expect exactly the same result every time. Experience has shown that 2000 or 3000 re-samples are enough to get nearly reproducible results.
If you try bootstrapping with very small samples, you might get a larger variety of bootstrap CIs.
Also, if you use a different style of nonparametric bootstrap (there are several possibilities) you may get get a somewhat different result.
Finally, if the data are not normal, you can't expect a bootstrap CI to give the same results as a t CI. In this case the bootstrap CI is usually preferred, because the t CI assumes normal data.
Notes: (1) The fictitious data for my bootstrap CIs above were normal, sampled in R as follows:
set.seed(407)
x = rnorm(900, 50, 7)
(2) It is important to understand that re-samples used in bootstrapping do not provide additional information. They are part of the analysis, not part of the experiment. In the example above, the re-samples took the place of (a) the formula for the estimated standard error of the sample mean and (b) looking in a printed t table to find the constant 1.963.
- 56,185