0

I have a situation where I have around 30 classes of variables with different means and variances (though the means aren't too far from eachother; think 4-7) and that the distributions are right skewed, and I am trying to do hypothesis testing on the sum of variables from these classes.

For example, sometimes I have 20 values sampled at random from these classes, and other times I have 50. When I plot the couple hundred sums I get a distribution that is close to normal.

Looking around I found the Lyapunov version of the central limit theorem. The only thing I'm catching myself on is the denominator in the normalization formula.

Normally I would take the standard deviation of the variables making up the sum and use that, but is that appropriate in this case? I believe it is but I'd like some confirmation or a source that goes over an application of Lyapunov to real data.

Edits based on comments:

Additional information: The distributions are positive right skewed

I've estimated the means of all the classes using a couple prior years of data, and can find the estimated standard deviations if needed.

What I'm trying to do is tell if a sum of n random variables that are a combination of different classes are statistically different from the sums of the estimated means of their classes.

So, suppose I have 20 values random chosen from the 30 classes (the same classes can be chosen more than once). I add up the values to a sum $X = x_1 + ... + x_{20}$. Is this significantly different from the sum of the means of the different classes represented here $\mu = \mu_1 + ... + \mu_{20}$?

  • do you know the means and variances of your classes? or can you at least estimate them? – Aksakal Sep 09 '21 at 15:29
  • 2
    There are a number of troubling aspects to this problem statement that need clarification. Is it really true you are lumping sums of 20 values in with sums of 50 values in your plot? Why? Second, no version of the CLT applies to data, because it's a statement about what eventually happens when a sample size becomes arbitrarily large. It would be more constructive for you to rewrite your question in terms of what you are really trying to do: tell us what the variables are, what your hypothesis is, and what test you are applying. – whuber Sep 09 '21 at 15:30
  • I added some additional comments. – Scott White Sep 09 '21 at 15:45
  • Are you perhaps asking how to obtain the distribution of the sum of $20$ randomly, independently chosen values from the $30$ classes? – whuber Sep 09 '21 at 16:32
  • @whuber more so I'm asking how I can check if the sum of 20 different values from the 30 classes is significant. – Scott White Sep 09 '21 at 16:37
  • That is equivalent to asking how to evaluate the percentage point function (inverse of the distribution function) at any arbitrary value associated with your level of significance. That is, it is tantamount to asking how to compute (or at least approximate) the entire distribution. The two most natural and easy methods are simulation and convolution, both of which are amply illustrated in other threads. For examples see https://stats.stackexchange.com/questions/69898, and search our site for convolution – whuber Sep 09 '21 at 16:44

1 Answers1

0

if the variables are independent then you can apply Lyapunov's CLT. Simply look at the $\Delta=\sum_i x_i-\sum_i\mu_i$ and compare it to $s_n=\sqrt{ \sum_i\sigma_i^2}$, something like $\Delta>2s_n$ would be significant.

Aksakal
  • 61,310