7

I am currently working on the analysis of measurement data and have the problem of obvious non-normal data. Since most methods I want to use require normality (especially statistical process control and process capability) I first tried to transform my data using a power transform, however this did not work.

The data I have consists of means of ~1500 samples each. Of these means I have around 500-700 samples for multiple days and different measurements. Here is an example of how such a measurement might be distributed. Depending on the day and the type of measurements, the different distributions differ greatly.

Example image of a measurement

Obviously this is only a sample and does not fully represent the underlying process. That's why I wanted to attempt to use bootstrapping to estimate the true population parameters of the underlying process which could then be used to define statistical limits. However I am not completely sure how valid such an approach is for my usecase. I know that due to the central limit theorem the means of my bootstrapped groups are going to follow a normal distribution which I was able to observe.

Bootstrapped measurement

I am not a statistician and relatively new to everything in the field of data science and statistics so I definitely lack some of the fundamentals and intuition needed to evaluate such things. When looking at the bootstrapped distribution, the mean is consistent with the total mean of all samples, but the distribution just does not seem wide enough. In my samples I have many values which are below 0.45 and many more which are above 0.47 but to my understanding, according to the bootstrapped distribution such values would be highly unlikely. I am more than sure that there is something I do not understand correctly about bootstrapping and the relation to population parameters so I hope someone is able to explain the error in my train of thought.

Krautsultan
  • 167
  • 9
  • 2
    By bootstrapping your sample (which is composed of means) and then taking the average of these means, and repeating this n times, you will get the bootstrap distribution of the mean of means, which is not what you want. You want to get the distribution of your population (from you sample). – user2974951 Aug 16 '23 at 09:24
  • @user2974951 Sadly I don't have access to the raw data, only the means. But I would also be more than happy to have the distribution parameter for the means. Does that mean however that without the raw data, even an estimation of the means is not possible? – Krautsultan Aug 16 '23 at 09:28
  • 2
    Your bootstrapped results shown in the second chart are for the mean. It is no surprise at all that the bootstrapped means have almost exactly the same mean as the original sample (that is inevitably their expectation given the original sample from which the bootstrap is taken), or that their distribution is closer to a normal distribution than the original sample (related to the central limit theorem) or that their standard deviation is about $\frac{1}{\sqrt n}$ times the original sample standard deviation. – Henry Aug 16 '23 at 22:49

1 Answers1

9

Bootstrap is used for estimating the uncertainty of some estimator. In your case, you seem to have calculated the mean of means and used bootstrap to find the distribution of the mean of means. This is not the distribution of the means, you already approximated this distribution by using the histogram. This is why the uncertainty does not range over the whole range of the means. Recall that if the standard deviation of the means is $\sigma^2$, then the standard error would be $\sqrt n$ times smaller and this is what you observed.

Bootstrap is not a method for estimating the true distribution or true mean of the data. If you want to estimate the raw data given the sample means $m_i$, you additionally need to know the sample sizes for each sample, in such a case, the mean of the raw data can be obtained using weighted mean of means, where the sample sizes $n_i$ are used as weights

$$ \tilde m = \frac{\sum_i m_i n_i}{\sum_i n_i} $$

if the sample sizes are the same, $n_1 = n_2 = \dots$, this is the same as using the arithmetic average of the means. If the sample sizes are not the same, the larger samples would have less impact on the final estimate than they need to, leading to a biased estimate.

See also a similar thread that we had recently: Can any quantity be bootstrapped?

Tim
  • 138,066
  • Thank you for the great answer! It definitely seems like I misunderstood the usage of bootstrapping to find population parameters. – Krautsultan Aug 16 '23 at 09:48
  • 2
    It doesn't make sense to discuss this problem until the context for the problem is clearly stated. What is the purpose of caring about what distribution the raw data have? Are you using a parametric method on the data? Which one? If you seek data transformations how do you incorportate the proper amount of transformation uncertainty in the final analysis? Why not use a nonparametric or semiparametric method? Start with the nonparametric CDF estimate (empirical cumulative distribution function estimator). Why do your data consist of means? Analyze the raw data, usually. – Frank Harrell Aug 16 '23 at 11:27