In a skewed sample with a large n, does Central Limit Theorem dictate that a t-test can be used, even if the mean cannot be interpreted?

Question

I understand that, in the case of a highly skewed population and sample, the sampling distribution of the mean can still be normally distributed if the sample size is large, according to Central Limit Theorem. This means that, arguably, the normality assumption of a t-test holds despite a skewed sample, because t-tests require that the sampling distribution is normal.

However, we also know that the Mean statistic (x bar) can be affected by extreme scores, hence the use of the median in national average salary data.

If the mean is misleading, can it still be used in t-tests, even if CLT suggests the normality assumption (of the sampling distribution of the sample mean) is met? Or is it the case that the standard error of the mean is robust to sample skewness, according to CLT, and thus we can be confident that our confidence intervals are accurate? Could a type II error not occur, whereby two populations appear equal because one has a highly skewed population that biases the mean, even if the two populations have different underlying central tendencies?

What is relevant is not "the sampling distribution" (of what? Of the data itself?), but the sampling distribution of the mean. This may be helpful. What do you mean by "the mean statistic can be inaccurate"? If your extreme scores are bona fide observations, then they come from the data generating process and therefore influence the expectation, which we estimate using the sample mean. — Stephan Kolassa, Aug 23 '23 at 09:40
Hi Stephan, thanks for this helpful comment. I've tightened up the accuracy of the language to reflect the comments. I agree that, presuming no erroneous outliers and all data are valid, the mean is "accurate" in what it is trying to do i.e. sigma x divided by n. Though, this does still pose the question; if the tails of the skewed distribution "drag" the mean upwards, we no longer have a mean that represents the "bulk", nor necessarily a standard deviation that indicates percentiles in the same way as a normal distribution. How does this influence our reporting and use of inferential stats? — Josh Blake, Aug 23 '23 at 10:06
If we are interested in the mean of a distribution, then we should work with the mean. Yes, that is tautological. If we believe the mean is too strongly influenced by the tails, then we should not be interested in the mean in the first place, but (say) in the median. Thus, I would say that our substantive question should govern which parameter we are interested in, and this in turn should govern our estimators and tests. Re your last comment, if we are interested in percentiles but have a skewed distribution, then "mean + x SDs" is simply the wrong tool to use, and we should use others. — Stephan Kolassa, Aug 23 '23 at 10:17
However, we also know that the Mean statistic (x bar) can be biased You have to be careful with this kind of claim. Unless the mean is undefined, $\bar X$ is an unbiased estimator in the technical sense of bias. — Dave, Aug 23 '23 at 10:19
"If we believe the mean is too strongly influenced by the tails, then we should not be interested in the mean in the first place". Indeed, which makes me wonder if the debate about CLT is a distraction from the more fundamental threat that skewness presents to parametric tests (e.g. t-test), that these tests rely on a tool that is not valid for its intended purpose? Surely the debate about whether large samples reduce the threat to the normality assumption in a t-test is wholly subsumed by the threat of skewness to the utility of the mean? Is pronounced skewness an automatic "no" for t-tests? — Josh Blake, Aug 23 '23 at 11:04
This question has been covered many, many times on this site. The CLT doesn't help us in applied statistics. It is strictly a limit theorem and many not result in an accurate enough approximation event for $n > 50,000$ is there is heavy skewness. The problem for your case lies more with the SD than with the sample mean. SD is not a good dispersion measure for asymmetric data distributions. — Frank Harrell, Aug 23 '23 at 11:37
Thanks for your comment Frank. Does heavy skewness not also represent a problem for the meaningfulness of the mean (and therefore standard error and parametric tests)? The specific question, which I do not think has been asked so much, is that posed above: should we be much more concerned about the meaningfulness of the mean than what the CLT says we can or cannot do? Presumably, if we are concerned about the mean as an indicator of central tendency, then this trumps CLT. Similarly, if not, then presumably the violation of normality is small anyway, and we do not need to worry about the CLT — Josh Blake, Aug 23 '23 at 12:50
"Meaningfulness" depends on the application. If one is interested in the total of a finite (but very large) population, then that's what one needs to know and estimating anything else is irrelevant. The moral is that the statistical characteristics of a sample should not (solely) determine the underlying subject matter question. — whuber, Aug 23 '23 at 13:52

In a skewed sample with a large n, does Central Limit Theorem dictate that a t-test can be used, even if the mean cannot be interpreted?

0 Answers0