0

So I estimated a particular statistic $\Phi$ (custom made) by bootstrapping 1000 samples from the original dataset to generate 1000 different $\Phi$s. The issue is that saving those 1000 bootstrap statistics was not at all memory efficient, so I decided to just save the summary statistics from the bootstrap samples. Saving the samples is also difficult because there are lot of iterations of the bootstrap.

So for example, for each iteration, I have 1000 samples of my dataset, I get 1000 different $\phi$, from which I calculate the summary (mean, median etc).

I would eventually report the summary statistic, but I also have to analyze the distribution of my bootstraps.

I have saved the mean, standard deviation, median, first and third quartiles, the min and max as well as the 5th and 95th percentiles. I do presume that the bootstrap distribution should look normal. But I wanted a robust way to generate my samples back (1000 samples) for further analysis based on these summary statistics.

Here is what I have tried so far. Assuming the bootstrap sampling distribution to be normal, I tried using the truncnorm distribution in Python to specify the min,max,mean and std deviation and generate 1000 samples. Then I find the index of the corresponding percentiles (medians and the others) in those 1000 samples, and just change them to the summary statistics I have.

I tried searching different StackOverflow forums for an answer but this is the best I could come up with till now.

It would be helpful if I can get further insights on this.

  • 1
    If you have access to the original data and know the random seed that was used, generating the samples is a deterministic process that can be repeated. – Tim May 31 '23 at 10:12
  • @Tim yes, computer generated "random" numbers are indeed deterministic! That is why i suggested setting the seed in my answer :-) just didn't want to reveal the brutal truth of deterministic random numbers so directly... – Ute May 31 '23 at 10:17
  • Why do you "find the index of the corresponding percentiles (medians and the others) in those 1000 samples, and just change them to the summary statistics I have."? If you save the summary statistics in a separate array, you would not need much more memory. – Ute May 31 '23 at 10:24
  • This is for re-creating the samples of my custom statistic. The samples are not indices but rather a complex matrix operation that I do on on my dataset. The samples have resampled indices on which I calculate my custom statistic.

    I did indeed set a seed, but why will repeat the calculation of my custom statistic all over again? It takes over 2 days for my big dataset.

    – cwanderroycbooks May 31 '23 at 14:04
  • @cwanderroycbooks, your last comment is actually pretty important to understand you problem. Maybe you could add this to your question. – Ute May 31 '23 at 14:22
  • If your data do not have indices, you can generate them, just numbering all your observations. Then you can do bootstrap resampling without calculating statistics, only getting the resampled indices. You could save them on disk if you run short on memory. Afterwards, you use the indices to create bootstrap samples by selecting the items from the original sample according to the indices. Only then calculate the time consuming statistics. Then you have both the bootstrap sample and the statistics. – Ute May 31 '23 at 14:27
  • How to solve your problem depends on the details. As an example, if your samples are large and univariate, then techniques described at https://stats.stackexchange.com/questions/35220 might achieve great compression. When the samples have more complex structures, you might be able to exploit those structures -- but we can't tell from your abstract description of the problem. Could you elaborate a little? – whuber May 31 '23 at 15:20
  • Thanks but sorry I don't get it. What's the use of saving the indices? I am already doing that and calculating my statistic from those samples. The idea is I have 1000 samples from my dataset (1000 different shuffles of my dataset) from which I generate 1000 different statistic, and this I have to do multiple times.

    Since I cannot save those 1000 statistic each time, I rather save their summary. But I need to analyze and plot the distribution, so I need to regenerate the distribution back.

    I have modified my question to reflect this.

    – cwanderroycbooks Jun 01 '23 at 02:30
  • It helps a lot that you edit your question :-) "... my bootstraps." do you mean the 1000 different $\phi$ that you generated? – Ute Jun 01 '23 at 06:22

1 Answers1

2

It is not feasible in general to try to retrieve the bootstrap samples from summary statistics: there is simply not enough information in them. Therefore you need some other way to keep information about the actual bootstrap samples, or to regenerate them quickly.

1. approach: keep bootstrap sample information

In a situation where the problem is that data items are "bulky", I'd split the bootstrapping into two steps. This requires that you have a dataset consisting of identifiable items (rows), and that you can pick the $i$-th element easily.

1.) generate samples of indices that point to elements from the original sample to be included in a particular bootstrap sample. You can also save these on disc if you need to repeat the experiment multiple times, and free / reuse memory.

2.) select the bootstrap samples one by one from the original data according to indices that were generated in step 1.), and calculate the complicated summary statistics.

2. approach: regenerate bootstrap sample information

Alternatively, if saving indices on disk is not an option, regenerate the samples from scratch, starting with the same state of random number generator (rng). In python, use random.seed, in R:set.seed. Then you can strip your code from all time consuming calculations of summary statistics and should retrieve the same bootstrap samples again in very short time.
Caveat!!: this approach only makes sense if calculating the summary statistics does not involve further simulations, such as MCMC. Simulation based statistics call the random number generator again. This would advance the rng state uncontrollably between bootstrap resampling steps.

Ute
  • 2,580
  • 1
  • 8
  • 22
  • That's the issue right. Estimating the statistic itself takes 2 days in the server. I can try saving the samples but the memory will fill up quickly. That's why I am saving the bootstrap statistics.

    I don't see how saving the indices of the bootstrap helps. I would anyway have to re-calculate my custom statistic all over again to get my samples. What's the point of wasting 2 days?

    – cwanderroycbooks May 31 '23 at 14:01
  • No, you could save the indices, and the statistic separately. To reconstruct the samples, you do not need to estimate the statistics. Say, if your original sample has 5 items, then you can save the indices of the bootstrap sample, which could be (1,5,3,1,2), (4,4,1,3,2), etc. If you save these indices, you can also use them to afterwards calculate the statistics.

    I understand your question as: you are replacing some values in the bootstrap sample with the statistics you calculated.

    – Ute May 31 '23 at 14:10
  • Disclosure: I do use approach 1) myself sometimes, when dealing with time consuming statistics. And I (almost) always save the start value for the random number generator together with bootstrapped or simulated, just in case. For an R user, time can really be an issue. – Ute May 31 '23 at 14:57
  • @cwanderroycbooks If you already have spent a lot of computing time on the statistics, then use approach 2 - you know the start value of your random number generator. I hope in that case that the complicated statistics did not envolve extra random number generation in a black box (MCMC package or so). Then approach 2 would be wrong, because every calculation of the statistics advances the random number generator in an uncontrolled way. – Ute May 31 '23 at 15:01
  • Sorry I think there is a misunderstanding. I am not worried about saving the samples, but rather my statistic. Suppose I am calculating a statistic A (using complex linear algebra) from a dataset. Now to bootstrap, I generate 1000 different realizations of that dataset, which generates 1000 different As. Since I cannot save the As, I rather save their summary. – cwanderroycbooks Jun 01 '23 at 02:32
  • What do you want to retrieve then, @cwanderroycbooks? The statistics A, without calculating it again? – Ute Jun 01 '23 at 05:18
  • yes. or atleast an approximation of it. – cwanderroycbooks Jun 01 '23 at 06:09
  • have you generated more than one set of summary statistics, and if yes, can you assume they come from the same distribution? – Ute Jun 01 '23 at 06:29