Statistical test should be used for comparing population means of skewed data

Question

I am trying to find out if the population means of 2 sets of data is significantly different at the 10% level. The 2 sets of data is the same size (25 sample points each), and is taken from the same sample but at different times. I plotted the data of each set and found that both sets of data is highly skewed towards the left, meaning that I cannot use T-test to compare the means. Does anyone have any suggestions on what test I could use to find if the 2 sets of data are significantly different or not.

This is an ideal setting for a permutation test. See https://stats.stackexchange.com/questions/43958 for an explanation, details, and code. — whuber, Dec 26 '21 at 16:52
Two issues: (1) You say "...taken from the same sample but at different times.,,," (emphasis mine). Do you mean taken from the same population? (2) Are the two samples of size 25, paired so that, for example, the first observation in the first sample is for the same 'individual' as the first observation in the second? Both of the previous comments seem to assume your data are paired. The answer does not assume pairing. I am not completely sure either way about that. Please clarify. — BruceET, Dec 26 '21 at 18:46

Glen_b · Answer 1 · 2021-12-27T02:01:39.007

When you say "from the same sample" it's not 100% clear whether you mean that you have two observations on each subject (paired data) or not. You should clarify.

There are several possibilities, of which I'll list some:

If the variable is one for which you have a suitable distributional model, you may be able to derive a suitable test (e.g. a test based on the likelihood ratio, or one that's asymptotically equivalent to it). Taking account of a suitable model if you have one may lead to more powerful tests than other approaches.

If you want to add a little safety on significance level you could combine this (as a way to choose a statistic) with a permutation test; this has the advantage of offering good power if your model is close to correct while still keeping the significance level close to correct if your model is not as accurate.
A test based on resampling, either a permutation test or a bootstrap test. At these sample sizes I'd be looking at a permutation test; if the distribution is continuous and the other assumptions hold (sufficient to have exchangeability under the null) this will guarantee your significance level is almost exactly correct (though power may not be all that great if the distribution is quite far from normal).
If you're prepared to make some additional assumptions, including about the form of the alternative (assumptions that would make it correspond to a test of means), you may be able to use a suitable rank-based test, like a signed rank test (for paired data) or a Wilcoxon-Mann-Whitney (for unpaired data).
If your distribution is not so strongly skew or heavy tailed you may get pretty close to the desired significance level with a t-test (presumably paired). As with item 2, even if the significance level is fine, there may be an issue with power.

If I was faced with this situation and didn't know much about the likely population distribution, I'd probably lean toward a straight permutation test. If your data are paired, this would presumably involve a one-sample permutation test on pair-differences.

score 0 · Answer 2 · answered Dec 26 '21 at 16:46

Try bootstrapping. Here's one approach. Lump all samples into one bag of 50 observations. Run $k$ simulations as follows. Sample 25 observations $x_{ki}$ with replacement from the bag, and another 25 observations $z_{ki}$ with replacement. Calculate the differences $\Delta_k=\bar x_k-\bar z_k$ for your $k$ simulations, where $\bar x_k,\bar z_k$ are the averages (sample means) of the two samples. Now you can estimate the distribution of $\Delta_k$ and compare to the differences of sample means of original two sample, and run tests of significance. For instance if the original difference of means was $\Delta_o=\hat\mu_1-\hat\mu_2$, then you could compare it to standard deviation of the distribution of $\Delta_k$ or a percentile, under null hypothesis that $\Delta_o=0$

Statistical test should be used for comparing population means of skewed data

2 Answers2