What is the correct SD to use to get a 95% CI for skewed data?

Question

Let $X = [94, 10, 100, 100, 16, 14, 100, 100, 70, 88, 100, 100, 12, 100, 100, 58, 32, 100, 32, 36, 98, 0, 100, 100, 100]$

where $X$ are students' scores (between 0 and 100), and note many full marks!

The Question is what statistics will best describe the data (note data is non-Gaussian)

Option 1

If I fit a Gaussian using maximum likelihood I will get

sample mean = 70.4, and SD = 37.96, so a mean +/- 1 SD gives an interval from 32.43 to 108.36.

Finally, If I fit a Gaussian to the data $X$ using normfit in matlab(R) and obtain a 95% confidence bound on the mean and standard deviation I will get $$ \begin{aligned} \mu &= 70.4 ; &CI_{95\%} = [54.73, 86.06] \\ \sigma &= 37.96 ; &CI_{95\%} = [29.64, 52.80] \end{aligned} $$ Option 2

On the other hand, what if I use left / right SD instead? I.e., to report two SD values, SD_left and SD_right where: $$ \begin{aligned} SD_{left} &= \sqrt{\frac{1}{N_{left}} * \sum(X*I(X<\mu) - \mu)^2} &= 49.94 \\ SD_{right} &= \sqrt{\frac{1}{N_{right}} * \sum(X*I(X\ge\mu) - \mu)^2} &= 29.45 \end{aligned} $$

where $\mu=70.4$ is the mean, $N_{left} = \sum(I(X<\mu)) - 1 = 9$ is number of samples less than the mean (minus 1 to remove bias) and $I$ is the indicator function which gives 1 if its argument is true or else 0; $N_{right} = \sum(I(X\ge\mu)) - 1 = 14$

In this case the interval around the mean is [20.46, 99.85], instead of the previous result, [32.43, 108.36].

Which one shall I go for, 1 or 2?

Could you please provide an explanation of what "left" and "right" SDs are, either using conventional mathematical notation or in English? And are you sure you only want to "describe" the data? A confidence interval is used for comparisons and testing, rarely for description. If you have some such comparison in mind, then please let us know what it is, because that has a strong bearing on the answer. — whuber, Aug 01 '12 at 15:56
For standard SD, we compute the squared difference between each sample Xi and the mean , i.e., (Xi - mu)^2 and then sum over all samples and divide by (N-1), finally taking the square root.
for Left SD, we take only the samples in X that are less than the mean (70.4) and use these in the std equation while adjusting for the new N to become N_Left

My question is more about what std value is more informative? the maximum likelihood estimate (which assumes a well behaved Gaussian) or the left/right estimate (which is still using the maximum likelihood mean of a Gaussian) — deepML, Aug 01 '12 at 16:25
I have taken the liberty of squaring the terms used for the asymmetric SD calculations. These expressions do show up on occasion, especially in financial applications, where versions of them are known as "semivariances" (before the square roots are obtained). However, it is unclear how one can construct intervals with them or what those intervals would mean. — whuber, Aug 01 '12 at 16:51
I have found this paper mentioning left / right variance, some examples are presented, please have a look. :)
"In this paper ...the use of the left and right variance is proposed and an index of asymmetry based on them is introduced. Several examples demonstrate its usefulness. The question of evaluating more accurately the dispersion of data about .."

http://faculty.kfupm.edu.sa/math/anwarj/Research/39IJMEST2004r.pdf — deepML, Aug 01 '12 at 17:12
Ah...so you are indeed interested in a description of the data that might account for a possible asymmetry around the mean. This suggest @Gung is on the right track: don't use semivariances, use robust statistics instead, such as the differences between the quartiles and the median. Indeed, these are the basis of the familiar boxplot and the five-letter summary from which it is developed. — whuber, Aug 01 '12 at 17:17
You have $25$ pieces of sample data. Unless your confidence interval includes the interval $[10,100]$, you will have 92% or less of your sample inside your confidence interval. This might suggest that a 95% confidence interval is not a good way to describe your sample data here. — Henry, Aug 02 '12 at 07:56
Related question: http://stats.stackexchange.com/questions/16516/how-can-i-calculate-the-confidence-interval-of-a-mean-in-a-non-normally-distribu — Felipe G. Nievinski, Jul 11 '16 at 02:18

score 6 · Answer 1 · answered Aug 01 '12 at 17:54

6

I would say not to use any standard deviation here. Your data isn't just skewed, it's bimoodal. This indicates something about the students: Either they "get it" or they don't. This fact is lost in any mean or standard deviation.

If you are just trying to describe the data, give percentiles. E.g. you could give 25th, 50th (median) and 75th; or perhaps quintiles. Better would be to do a density plot.

If you are trying to model the data, find a distribution that is bimodal.

answered Aug 01 '12 at 17:54

Peter Flom

119,535
36
175
383

I thought about bimodal distributions, so assume I fit a mixture of two gaussians, each has a mean and a segma. What would the interpretation of the data be in words. Would it be, my class is divided into two groups, the first group (N1 subjects) have a high mean perf with std of say segma1, and the second group on the other hand (N2<N1) have a bad performance with mean2 and segma2. But what can I say about the class as a whole, instead of about groups in the class, was it good or bad? – deepML Aug 01 '12 at 19:12
2

Well, the way it looks to me, it is not the mixture of two Gaussians. Certainly the upper part is not Gaussian, with all those 100's. As for the lower half, it looks to me like it has only 8 values, so it will be hard to say what distribution fits it best. – Peter Flom Aug 01 '12 at 21:23
+1 for a density plot. If you can use graphics, forget summary statistics and just plot the thing. Picture, thousand words, etc. – shadowtalker Jun 27 '14 at 04:32

score 1 · Answer 2 · answered Aug 01 '12 at 16:22

1

Your second option is not something I've ever heard of. Although I can read the equations, it doesn't really make any sense to me as a valid option. Where are you getting this from, I wonder?

I gather you want to be able to describe the uncertainty in the central tendency of your sample. That's a reasonable thing to want to know. The situation is that your data are skewed, and bounded, and you have ceiling effects. My guess is that you might want to use the median to represent the central tendency of your distribution, because it isn't affected by these things. To get a 95% CI for the median, you can bootstrap your data. I don't know MATLAB, but this link may help.

answered Aug 01 '12 at 16:22

gung - Reinstate Monica

145,122

My question is more about what std value is more informative? the maximum likelihood estimate (which assumes a well behaved Gaussian) or the left/right estimate (which is still using the maximum likelihood mean of a Gaussian)
so shall I say my students predictive performance (ability to correctly answer a question) is mean +- segma (and report CI on both mean and segma) OR say predictive performance is mean +segma_right/-segma_left what I am concerned about is you might have small number of samples below the mean (say only two samples 0 and 70) and the std is massive
– deepML Aug 01 '12 at 16:35
2

I wouldn't say either of those things; I don't think either is valid. If you just want to say something about the middle of your sample distribution, you could use the inter-quartile range. If you want to say something about the likely precision w/ which the center of your distribution can be estimated, I would bootstrap the median. – gung - Reinstate Monica Aug 01 '12 at 16:49

What is the correct SD to use to get a 95% CI for skewed data?

2 Answers2